SPECSIA: Stylization Dataset for Novel-View Enhancement in Drawing-based 3D Animation

Abstract

Generating animation from a single 2D drawing is challenging because the output must preserve character appearance while remaining plausible and temporally coherent under motion. Existing drawing-based 3D animation pipelines often use sample-wise 2D refinement to align animated renderings with the input image, but such optimization tends to overfit to the observed view and fails to correct projection-induced artifacts in novel views. To address this limitation, we introduce SPECSIA-15K, a paired stylization dataset for novel-view enhancement in drawing-based 3D animation, constructed from multi-view renderings of 3DBiCar with 10 views per object and 14,980 artifact-corrupted projection/refinement-target pairs. We further present Drawing-based View Enhancement (DraViE), a lightweight plug-and-play refinement module trained with data-level priors to remove novel-view artifacts while preserving drawing style and motion plausibility. Experiments show consistent gains in novel-view fidelity and temporal coherence with lower per-character adaptation cost than sample-wise fine-tuning.

Motivation

(a) Artifacts in novel-view renderings — missing contours and speckle-like regions (red boxes).

(b) CLIP–FID joint plot: blue (frontal) vs. red (novel) — quality consistently drops in novel views.

Existing drawing-based 3D animation pipelines produce systematic artifacts in novel views. Lines connect frontal/novel-view scores for each method, revealing a consistent quality drop.

Method

Sample-wise vs. Data-level Alignment

(a) Prior sample-wise stylization networks optimize per-character, overfitting to the observed view. (b) Our approach trains a data-level prior on SPECSIA-15K, enabling generalization to novel views.

🗂️

SPECSIA-15K Dataset

Paired dataset for 2D projection refinement in drawing-based animation.

1,498 3DBiCar characters
10 uniformly-sampled viewpoints/char
14,980 paired samples total
Triplet input: (Z, Z_mask, Z_pos) + GT Y*
512 × 512 RGBA PNG

🔧

DraViE Pipeline

Plug-and-play 2D refinement module trained with data-level priors.

1Pre-train on SPECSIA-15K with patch-wise (32×32) L₁ + adversarial + VGG loss

2Lightweight adaptation (optional) — single-epoch fine-tune on input character

3Inference — fully convolutional; processes full-resolution frames

Dataset Construction

For each character: render 10 views as artifact-free GT Y* → reconstruct 3D from frontal view → project to same viewpoints as artifact-prone Z. Each sample provides a triplet (Z, Z_mask, Z_pos) paired with Y*.

Pre-training with Patch-wise Objective

Patch centers are sampled from the foreground mask. Aligned 32×32 patches from the input triplet and Y* are used for patch-wise supervision. At inference, the fully convolutional network processes full-resolution frames directly.

DraViE Inference Pipeline

(a) Pre-training on SPECSIA-15K. (b) Optional lightweight adaptation to the input character. (c) Plug-and-play post-correction on novel-view projections from any upstream 3D animation system.

Experimental Results

Table 1 — Quantitative Evaluation

Method	Frontal-view				Novel-view ⭑
Method	CLIP↑	SSIM↑	LPIPS↓	FID↓	CLIP↑	SSIM↑	LPIPS↓	FID↓
DSU (DrawingSpinUp)	0.902	0.840	0.212	205.19	0.847	0.833	0.250	280.83
OSF	0.905	0.841	0.211	202.24	0.850	0.835	0.249	270.21
DraViE (Ours)	0.905	0.846	0.208	201.31	0.859	0.840	0.245	206.13

⭑ Novel-view gains are the primary contribution. Metrics averaged over 10 runs.

Table 2 — Cross-pipeline Generalization

Upstream	DSU CLIP↑	OSF CLIP↑	Ours CLIP↑	DSU FID↓	OSF FID↓	Ours FID↓
Wonder3D	.874	.877	.882	243	236	203
InstantMesh	.831	.835	.840	242	233	205
CRM	.807	.810	.848	271	266	239

Same DraViE model applied to three different upstream 3D reconstruction backends without retraining.

User Preference Study (30-person blinded A/B)

Artifact Reduction

73.3%

26.7%

Style Preservation

60.0%

40.0%

Overall Quality

86.7%

13.3%

DraViE (Ours) Prior methods

Qualitative Comparison

DraViE vs. DrawingSpinUp (DSU) and OSF across diverse characters and motions. Red boxes highlight corrected projection artifacts.

Video Results

Comparison Set 1

Drawing-based 3D animation results — DraViE reduces speckle artifacts and preserves drawing style across diverse motions.

Comparison Set 2

DraViE preserves fine-grained appearance details (mouth shape, clothing logos, hair color, stroke patterns) under viewpoint changes.

Failure Cases

Representative failure cases. DraViE may hallucinate plausible structures (e.g., leg separation) when upstream reconstruction topology is incorrect.

Ablation Study

Table 3 — Effect of Lightweight Adaptation and Positional Hint

Setting	Frontal-view				Novel-view
Setting	CLIP↑	SSIM↑	LPIPS↓	FID↓	CLIP↑	SSIM↑	LPIPS↓	FID↓
w/o Lightweight Adaptation	0.900	0.840	0.212	207.15	0.838	0.834	0.250	221.48
w/ Lightweight Adaptation	0.905	0.846	0.208	201.31	0.859	0.840	0.245	206.13
w/o Positional Hint	0.900	0.845	0.212	205.81	0.844	0.839	0.250	318.39
w/ Positional Hint (full)	0.905	0.846	0.208	201.31	0.859	0.840	0.245	206.13

Removing the positional hint causes a large novel-view FID increase (206 → 318), confirming its role in location-aware correction.

Effect of Lightweight Adaptation

Without adaptation, the pre-trained prior removes artifacts but deviates from the input drawing appearance (e.g., mismatched colors, over-smoothed strokes). With adaptation, input-specific style is preserved.

Effect of Positional Hint

Without the positional hint, the model applies conservative smoothing and fails to recover fine structures (e.g., mouth) or remove speckling artifacts in specific regions (e.g., hair, legs).

Limitations

(a) Fused legs. (b) Hallucinated gap.

DraViE is a modular 2D post-correction component and cannot fully recover from severe upstream failures such as incorrect topology, missing large regions, or erroneous rigging.

When the reconstructed mesh merges the legs into a single region, DraViE cannot reliably reconstruct the correct separation. In some cases the learned prior may hallucinate a leg gap — favoring a plausible silhouette over the true input geometry.

SPECSIA-15K may encode systematic biases from the Wonder3D reconstruction backend and 3DBiCar source assets. Additionally, DraViE processes frames independently without explicit temporal attention, so rapid rotations or high-frequency details already absent in the projection may still cause flickering.

SPECSIA-15K Dataset

🗂️

14,980

Paired samples

👤

1,498

3DBiCar characters

🔭

Viewpoints/character

⚠️ License notice: SPECSIA-15K contains only rendered pairs permitted for redistribution. Raw 3DBiCar 3D assets and Mixamo motion files are not included and must be obtained separately under their respective licenses.

Split	Characters	Samples
Train	1,298	12,980
Val	100	1,000
Test	100	1,000
Total	1,498	14,980

📦 Download on Hugging Face

SPECSIA: Stylization Dataset for Novel-View Enhancementin Drawing-based 3D Animation

Motivation

Method

Sample-wise vs. Data-level Alignment

SPECSIA-15K Dataset

DraViE Pipeline

Dataset Construction

Pre-training with Patch-wise Objective

DraViE Inference Pipeline

Experimental Results

Table 1 — Quantitative Evaluation

Table 2 — Cross-pipeline Generalization

User Preference Study (30-person blinded A/B)

Qualitative Comparison

Video Results

Comparison Set 1

Comparison Set 2

Failure Cases

Ablation Study

Table 3 — Effect of Lightweight Adaptation and Positional Hint

Effect of Lightweight Adaptation

Effect of Positional Hint

Limitations

SPECSIA-15K Dataset

BibTeX

SPECSIA: Stylization Dataset for Novel-View Enhancement
in Drawing-based 3D Animation