← Blog
KRAFTON AI / Research Preview

Raon-VisionEncoder
A Vision Encoder for Raon

The story of building a SigLIP2-class vision encoder from scratch using only open data.

Mission: Give Raon Eyes

This project began to give KRAFTON AI's Raon "eyes" that could understand the world. For a model to converse with the world beyond text, a powerful vision encoder that precisely understands images is essential. Our goal was to reproduce Google's SigLIP2[1], the current state-of-the-art vision encoder.

💡 Raon-VisionEncoder project goal: Train a SigLIP2-class vision encoder from scratch using only open data, and make the entire process public.

The Target Is SigLIP2, but Key Ingredients Are Missing

SigLIP2 is a model that demonstrates top-tier performance across a wide range of vision benchmarks including zero-shot classification, retrieval, segmentation, and grounding. Its training pipeline is well-organized into three phases: Phase 1 builds fundamentals with contrastive learning and LocCa[3] loss, Phase 2 learns dense features through self-supervised learning (SILC[8], TIPS[9]), and Phase 3 adapts to various resolutions with NaFlex[14].

The problem was that while we had the recipe, the key ingredients were missing.

Missing Ingredient 1: 10B-Scale Training Data

SigLIP2 is trained on Google's proprietary dataset, WebLI[2] — 10 billion image-text pairs that we had no access to whatsoever. We had to match this scale using only open data.

Missing Ingredient 2: Grounding Data

SigLIP2's LocCa decoder learns to indicate and describe specific regions in images using bounding boxes. To achieve this, SigLIP2 runs OWL-ViT[4], a large-scale object detector, across the entire training dataset to generate billions of pseudo annotations. Estimated at over 150 days on 8 H200 GPUs, this requires enormous compute resources.

Missing Ingredient 3: Training Code

While SigLIP2's paper describes the training recipe in detail, the training code itself is not publicly available. From the 3-phase training pipeline, to the combination of various loss functions, to NaFlex resolution adaptation — everything had to be implemented from scratch based on the paper's descriptions.

Preparing the Ingredients: Assembling a 10.7B Data Pool

The WebLI data used to train SigLIP2 has an undisclosed composition, making direct use difficult. As an alternative, we decided to follow the approach of MetaCLIP[5], which has a publicly available large-scale data acquisition and curation pipeline.

The data pool was built on LAION-400M (280M)[6] and DataComp-1B (1.4B)[7], and to fill the remaining gap, we directly processed Common Crawl web archives. We extracted image-text pairs from a total of 42 snapshots spanning 2017 to 2026, yielding approximately 9 billion images.

Downloading and processing 9 billion images is a task that, if approached naively, could consume tens of days on its own. To prevent spending the entire project timeline on data download alone, we had to solve numerous engineering challenges one by one. We built an optimized pipeline and ran 56 machines for approximately 10 days to acquire about 9 billion additional images.

After assembling the total 10.7B image pool, we applied MetaCLIP balancing and CLIP score-based filtering to select high-quality training data, extracting meaningful data from the entire pool through a curation process.

Data Pipeline Overview
10.7B Data Pool
42 Snapshots · 2017–2026
Common Crawl
~9B
Open Source
DataComp-1B
1.4B
Open Source
LAION-400M
280M
🧠
CLIP Embedding Extraction
🔍
Deduplication
🛡️
NSFW Filtering
⚖️
MetaCLIP Balancing
📊
CLIP Ranking
10.7B
996M
Final Training Set Data Pool

Grounding Data: Quality Over Quantity

Since we couldn't reproduce SigLIP2's billions of pseudo annotations, we changed our approach. We collected high-quality human-annotated data from open datasets such as OpenImages[16], Objects365[17], and Visual Genome[18], and added model-generated annotations from the GranD[15] dataset, combining a total of 17.2M grounding data samples. While this is an extremely small scale compared to SigLIP2's billions, our strategy was to compete on annotation quality.

Grounding Data · SigLIP2 vs Ours

SigLIP2

~10B
Pseudo annotations
  • Auto-generated from entire training data using OWL-ViT
  • 150+ days on 8 H200 GPUs
  • Quality limited by detector performance

Raon-VisionEncoder

17.2M
High-quality open-source annotations
  • Primarily human-created ground truth annotations
  • No additional GPU compute
  • High-quality grounding data

Beyond the Recipe: Our Additions

We didn't stop at faithfully reproducing SigLIP2's recipe — we introduced three key improvements.

3-Phase Training Pipeline
Phase 1
Contrastive + LocCa
Sigmoid Contrastive Learning
Captioning + Grounding
Phase 2
+ Self-Supervised + Regularizer NEW
+ Self-Distillation
+ Mask Prediction
+ GramNEW
+ KoLeo Reg.NEW
Phase 3
NaFlex Resolution      
Variable Sequence Training
Modification 1
Separate Grounding Batch
Spatial learning decoupled from contrastive learning
Modification 2
Gram + KoLeo
DINOv2/v3 regularization added
Modification 3
Muon Optimizer
~2× sample efficiency vs AdamW

Separate Grounding Batch Construction

In SigLIP2, every training image has pseudo annotations, so the entire batch participates in all tasks. Since we use grounding data from separate sources, we introduced a mixed batch strategy where a portion of the batch is composed of grounding-only samples. Grounding samples only perform referring expression prediction and grounded captioning, and are excluded from contrastive learning.

Gram Anchoring and KoLeo Regularization

SigLIP2's self-supervised learning objectives (SILC[8], TIPS[9]) are derived from the DINO family of methods, but they lack mechanisms to preserve global feature correlation structure or prevent mode collapse in the embedding space. To address this, we added DINOv3[11]'s Gram Anchoring (preserving the feature correlation structure learned in Phase 1) and DINOv2[10]'s KoLeo regularization[13] (maintaining uniform coverage of the embedding space) to Phase 2.

Muon Optimizer

We adopted the Muon optimizer[12] instead of SigLIP2's AdamW. Muon has been reported to achieve approximately 2× sample efficiency compared to Adam, making it a perfect fit for our situation of training with 10× less data. In ablation experiments, we confirmed consistent improvements of +2.83 on ImageNet and +2.75 on grounding compared to AdamW.

SigLIP2 (Original)

  • Optimizer: AdamW
  • Phase 2 regularization: —
  • Grounding: OWL-ViT pseudo annotations (10B samples)
  • Training data: WebLI (proprietary, 10B samples)

Raon-VisionEncoder (Ours)

  • Optimizer: Muon (~2× efficiency)
  • Phase 2 regularization: Gram + KoLeo
  • Grounding: Open-source high-quality (17.2M samples)
  • Training data: Open-source (1B samples)

Results: Closing in on SigLIP2

Here are the results of our final model, trained for approximately 2 weeks on 80 B200 GPUs. Despite training with 10× less data using only open data, we achieved results that match or exceed SigLIP2 on several benchmarks.

Benchmark Results
Raon-VisionEncoder (Open Data)
SigLIP2-NaFlex (Proprietary Data)
🏆 Exceeds SigLIP2
+2.12
Seg mIoU lead
≈ 17.2M vs 10B
Grounding
Smaller Dataset
Still competitive
≈ On Par
Retrieval
Competitive
🏆 EN → KO
KO ≈
English-only training
Matches Korean VLM
Gap Remains
~5%pt
in ImageNet Cls
Data scale difference?

Dense Prediction: Surpassing SigLIP2

Raon-VisionEncoder performed strongest in dense prediction tasks. It surpassed SigLIP2-NaFlex on Pascal VOC segmentation by +2.12 mIoU and on depth estimation by δ₁ +2.70.

Grounding: 17.2M Samples Against Billions

In grounding, we achieved competitive performance with SigLIP2 (AVG 75.40 vs 75.78), outperforming on 4 of 8 RefCOCO splits. This demonstrates that 17.2M high-quality open-source grounding samples can compete with the OWL-ViT pseudo annotations generated from SigLIP2's entire 10B training images.

Classification and Retrieval: A Gap Remains, but a Meaningful Start

In zero-shot classification, a gap of approximately 5 points remains compared to SigLIP2. This reflects the absolute difference in training data (1B vs 10B samples) and the quality difference of the proprietary WebLI data — results well within the expected range. In retrieval, there is also an approximately 3 point gap on COCO I2T R@1, but on Flickr30K, we achieved near-parity with SigLIP2-NaFlex.

VLM Evaluation: English-Only Training Extends to Korean

We evaluated the vision encoder's VLM downstream performance under the LLaVA-OneVision-1.5 Quick-start protocol. In English benchmarks, it outperformed SigLIP2 on MMMU (45.89 vs 43.44), RealWorldQA (66.14 vs 64.97), CV-Bench (70.90 vs 70.74), and AI2D (75.00 vs 74.90), while nearly matching on ChartQA (71.12 vs 71.48). The notable finding is in Korean benchmarks. Despite being trained exclusively on English data, it achieved parity or advantages over SigLIP2 — which was trained on multilingual data — on KRETA (49.28 vs 48.04), K-VisCUIT (63.47 vs 62.40), and DTCBench (43.75 vs 43.75).

Looking Back, and Looking Ahead

Over the course of the project, we completed the entire process from infrastructure setup to data downloading and curation, ablation experiments, and final model training. This project delivers more than a single checkpoint. By releasing the complete blueprint — including data curation code, training configurations, and model checkpoints — we aim to enable anyone to reproduce a SigLIP2-class vision encoder and build upon it.

There are still challenges ahead. While currently trained on English-only data, the model could benefit from multilingual data even on English tasks, as shown in prior works[20]. Additionally, model-in-the-loop curation, where the model itself guides data selection during training, is another direction that could improve data efficiency.

We look forward to Raon-VisionEncoder becoming excellent eyes for Raon.

Citation

Jungseok Cho, Joo Young Choi†, Han EunGi, Hyunjin Kim†, Jaeah Lee‡, Hakyoung Lee, Seonho Lee, Suekyeong Nam, Soohwan Park†, Sungchan Park, Juhyeong Sun, Moonwon Yu

KRAFTON AI

† Project leads. ‡ Seoul National University. Authors are listed in alphabetical order.

@misc{raon-visionencoder2026, title = {Raon-VisionEncoder: A Vision Encoder for Raon}, author = {Jungseok Cho and Joo Young Choi and Han EunGi and Hyunjin Kim and Jaeah Lee and Hakyoung Lee and Seonho Lee and Suekyeong Nam and Soohwan Park and Sungchan Park and Juhyeong Sun and Moonwon Yu}, year = {2026}, note = {KRAFTON AI} }

References