SPECSIA teaser
Abstract

Generating animation from a single 2D drawing is challenging because the output must preserve character appearance while remaining plausible and temporally coherent under motion. Existing drawing-based 3D animation pipelines often use sample-wise 2D refinement to align animated renderings with the input image, but such optimization tends to overfit to the observed view and fails to correct projection-induced artifacts in novel views. To address this limitation, we introduce SPECSIA-15K, a paired stylization dataset for novel-view enhancement in drawing-based 3D animation, constructed from multi-view renderings of 3DBiCar with 10 views per object and 14,980 artifact-corrupted projection/refinement-target pairs. We further present Drawing-based View Enhancement (DraViE), a lightweight plug-and-play refinement module trained with data-level priors to remove novel-view artifacts while preserving drawing style and motion plausibility. Experiments show consistent gains in novel-view fidelity and temporal coherence with lower per-character adaptation cost than sample-wise fine-tuning.

Motivation

Qualitative analysis on novel-view

(a) Artifacts in novel-view renderings — missing contours and speckle-like regions (red boxes).

CLIP vs FID plot

(b) CLIP–FID joint plot: blue (frontal) vs. red (novel) — quality consistently drops in novel views.

Existing drawing-based 3D animation pipelines produce systematic artifacts in novel views. Lines connect frontal/novel-view scores for each method, revealing a consistent quality drop.

Method

Sample-wise vs. Data-level Alignment

Comparison of pipelines

(a) Prior sample-wise stylization networks optimize per-character, overfitting to the observed view. (b) Our approach trains a data-level prior on SPECSIA-15K, enabling generalization to novel views.


🗂️

SPECSIA-15K Dataset

Paired dataset for 2D projection refinement in drawing-based animation.

  • 1,498 3DBiCar characters
  • 10 uniformly-sampled viewpoints/char
  • 14,980 paired samples total
  • Triplet input: (Z, Zmask, Zpos) + GT Y*
  • 512 × 512 RGBA PNG
🔧

DraViE Pipeline

Plug-and-play 2D refinement module trained with data-level priors.

1Pre-train on SPECSIA-15K with patch-wise (32×32) L₁ + adversarial + VGG loss
2Lightweight adaptation (optional) — single-epoch fine-tune on input character
3Inference — fully convolutional; processes full-resolution frames

Dataset Construction

SPECSIA-15K dataset construction

For each character: render 10 views as artifact-free GT Y* → reconstruct 3D from frontal view → project to same viewpoints as artifact-prone Z. Each sample provides a triplet (Z, Zmask, Zpos) paired with Y*.


Pre-training with Patch-wise Objective

DraViE pre-training

Patch centers are sampled from the foreground mask. Aligned 32×32 patches from the input triplet and Y* are used for patch-wise supervision. At inference, the fully convolutional network processes full-resolution frames directly.

DraViE Inference Pipeline

DraViE inference pipeline

(a) Pre-training on SPECSIA-15K. (b) Optional lightweight adaptation to the input character. (c) Plug-and-play post-correction on novel-view projections from any upstream 3D animation system.

Experimental Results

Table 1 — Quantitative Evaluation

Method Frontal-view Novel-view ⭑
CLIP↑SSIM↑LPIPS↓FID↓ CLIP↑SSIM↑LPIPS↓FID↓
DSU (DrawingSpinUp) 0.9020.8400.212205.19 0.8470.8330.250280.83
OSF 0.9050.8410.211202.24 0.8500.8350.249270.21
DraViE (Ours) 0.9050.8460.208201.31 0.8590.8400.245206.13

⭑ Novel-view gains are the primary contribution. Metrics averaged over 10 runs.


Table 2 — Cross-pipeline Generalization

Upstream DSU CLIP↑OSF CLIP↑Ours CLIP↑ DSU FID↓OSF FID↓Ours FID↓
Wonder3D .874.877.882243236203
InstantMesh.831.835.840242233205
CRM .807.810.848271266239

Same DraViE model applied to three different upstream 3D reconstruction backends without retraining.


User Preference Study (30-person blinded A/B)

Artifact Reduction
73.3%
26.7%
Style Preservation
60.0%
40.0%
Overall Quality
86.7%
13.3%
DraViE (Ours) Prior methods

Qualitative Comparison

Qualitative comparison

DraViE vs. DrawingSpinUp (DSU) and OSF across diverse characters and motions. Red boxes highlight corrected projection artifacts.

Video Results

Comparison Set 1

Drawing-based 3D animation results — DraViE reduces speckle artifacts and preserves drawing style across diverse motions.


Comparison Set 2

DraViE preserves fine-grained appearance details (mouth shape, clothing logos, hair color, stroke patterns) under viewpoint changes.


Failure Cases

Representative failure cases. DraViE may hallucinate plausible structures (e.g., leg separation) when upstream reconstruction topology is incorrect.

Ablation Study

Table 3 — Effect of Lightweight Adaptation and Positional Hint

Setting Frontal-view Novel-view
CLIP↑SSIM↑LPIPS↓FID↓ CLIP↑SSIM↑LPIPS↓FID↓
w/o Lightweight Adaptation0.9000.8400.212207.150.8380.8340.250221.48
w/ Lightweight Adaptation0.9050.8460.208201.310.8590.8400.245206.13
w/o Positional Hint0.9000.8450.212205.810.8440.8390.250318.39
w/ Positional Hint (full)0.9050.8460.208201.310.8590.8400.245206.13

Removing the positional hint causes a large novel-view FID increase (206 → 318), confirming its role in location-aware correction.


Effect of Lightweight Adaptation

Lightweight adaptation ablation

Without adaptation, the pre-trained prior removes artifacts but deviates from the input drawing appearance (e.g., mismatched colors, over-smoothed strokes). With adaptation, input-specific style is preserved.


Effect of Positional Hint

Positional hint ablation

Without the positional hint, the model applies conservative smoothing and fails to recover fine structures (e.g., mouth) or remove speckling artifacts in specific regions (e.g., hair, legs).

Limitations

Failure cases

(a) Fused legs. (b) Hallucinated gap.

DraViE is a modular 2D post-correction component and cannot fully recover from severe upstream failures such as incorrect topology, missing large regions, or erroneous rigging.

When the reconstructed mesh merges the legs into a single region, DraViE cannot reliably reconstruct the correct separation. In some cases the learned prior may hallucinate a leg gap — favoring a plausible silhouette over the true input geometry.

SPECSIA-15K may encode systematic biases from the Wonder3D reconstruction backend and 3DBiCar source assets. Additionally, DraViE processes frames independently without explicit temporal attention, so rapid rotations or high-frequency details already absent in the projection may still cause flickering.

SPECSIA-15K Dataset

🗂️
14,980
Paired samples
👤
1,498
3DBiCar characters
🔭
10
Viewpoints/character
⚠️ License notice: SPECSIA-15K contains only rendered pairs permitted for redistribution. Raw 3DBiCar 3D assets and Mixamo motion files are not included and must be obtained separately under their respective licenses.
SplitCharactersSamples
Train1,29812,980
Val1001,000
Test1001,000
Total1,49814,980

📦 Download on Hugging Face

BibTeX

TBA