Raon-VisionEncoder

Mission: Give Raon Eyes

This project began to give KRAFTON AI's Raon "eyes" that could understand the world. For a model to converse with the world beyond text, a powerful vision encoder that precisely understands images is essential. Our goal was to reproduce Google's SigLIP2[1], the current state-of-the-art vision encoder.

💡 Raon-VisionEncoder project goal: Train a SigLIP2-class vision encoder from scratch using only open data, and make the entire process public.

The Target Is SigLIP2, but Key Ingredients Are Missing

SigLIP2 is a model that demonstrates top-tier performance across a wide range of vision benchmarks including zero-shot classification, retrieval, segmentation, and grounding. Its training pipeline is well-organized into three phases: Phase 1 builds fundamentals with contrastive learning and LocCa[3] loss, Phase 2 learns dense features through self-supervised learning (SILC[8], TIPS[9]), and Phase 3 adapts to various resolutions with NaFlex[14].

The problem was that while we had the recipe, the key ingredients were missing.

Missing Ingredient 1: 10B-Scale Training Data

SigLIP2 is trained on Google's proprietary dataset, WebLI[2] — 10 billion image-text pairs that we had no access to whatsoever. We had to match this scale using only open data.

Missing Ingredient 2: Grounding Data

SigLIP2's LocCa decoder learns to indicate and describe specific regions in images using bounding boxes. To achieve this, SigLIP2 runs OWL-ViT[4], a large-scale object detector, across the entire training dataset to generate billions of pseudo annotations. Estimated at over 150 days on 8 H200 GPUs, this requires enormous compute resources.

Missing Ingredient 3: Training Code

While SigLIP2's paper describes the training recipe in detail, the training code itself is not publicly available. From the 3-phase training pipeline, to the combination of various loss functions, to NaFlex resolution adaptation — everything had to be implemented from scratch based on the paper's descriptions.

Preparing the Ingredients: Assembling a 10.7B Data Pool

The WebLI data used to train SigLIP2 has an undisclosed composition, making direct use difficult. As an alternative, we decided to follow the approach of MetaCLIP[5], which has a publicly available large-scale data acquisition and curation pipeline.

The data pool was built on LAION-400M (280M)[6] and DataComp-1B (1.4B)[7], and to fill the remaining gap, we directly processed Common Crawl web archives. We extracted image-text pairs from a total of 42 snapshots spanning 2017 to 2026, yielding approximately 9 billion images.

Downloading and processing 9 billion images is a task that, if approached naively, could consume tens of days on its own. To prevent spending the entire project timeline on data download alone, we had to solve numerous engineering challenges one by one. We built an optimized pipeline and ran 56 machines for approximately 10 days to acquire about 9 billion additional images.

After assembling the total 10.7B image pool, we applied MetaCLIP balancing and CLIP score-based filtering to select high-quality training data, extracting meaningful data from the entire pool through a curation process.

Data Pipeline Overview

10.7B Data Pool

42 Snapshots · 2017–2026

Common Crawl

~9B

Open Source

DataComp-1B

1.4B

Open Source

LAION-400M

280M

🧠

CLIP Embedding Extraction

🔍

Deduplication

🛡️

NSFW Filtering

⚖️

MetaCLIP Balancing

📊

CLIP Ranking

10.7B

996M

Final Training Set Data Pool

Grounding Data: Quality Over Quantity

Since we couldn't reproduce SigLIP2's billions of pseudo annotations, we changed our approach. We collected high-quality human-annotated data from open datasets such as OpenImages[16], Objects365[17], and Visual Genome[18], and added model-generated annotations from the GranD[15] dataset, combining a total of 17.2M grounding data samples. While this is an extremely small scale compared to SigLIP2's billions, our strategy was to compete on annotation quality.

Grounding Data · SigLIP2 vs Ours

SigLIP2

~10B

Pseudo annotations

Auto-generated from entire training data using OWL-ViT
150+ days on 8 H200 GPUs
Quality limited by detector performance

17.2M

High-quality open-source annotations

Primarily human-created ground truth annotations
No additional GPU compute
High-quality grounding data

Beyond the Recipe: Our Additions

We didn't stop at faithfully reproducing SigLIP2's recipe — we introduced three key improvements.

3-Phase Training Pipeline

Phase 1

Contrastive + LocCa

Sigmoid Contrastive Learning

Captioning + Grounding

→

Phase 2

+ Self-Supervised + Regularizer NEW

+ Self-Distillation

+ Mask Prediction

+ GramNEW

+ KoLeo Reg.NEW

→

Phase 3

NaFlex Resolution

Variable Sequence Training

Modification 1
Separate Grounding Batch
Spatial learning decoupled from contrastive learning

Modification 2
Gram + KoLeo
DINOv2/v3 regularization added

Modification 3
Muon Optimizer
~2× sample efficiency vs AdamW

Separate Grounding Batch Construction

In SigLIP2, every training image has pseudo annotations, so the entire batch participates in all tasks. Since we use grounding data from separate sources, we introduced a mixed batch strategy where a portion of the batch is composed of grounding-only samples. Grounding samples only perform referring expression prediction and grounded captioning, and are excluded from contrastive learning.

Gram Anchoring and KoLeo Regularization

SigLIP2's self-supervised learning objectives (SILC[8], TIPS[9]) are derived from the DINO family of methods, but they lack mechanisms to preserve global feature correlation structure or prevent mode collapse in the embedding space. To address this, we added DINOv3[11]'s Gram Anchoring (preserving the feature correlation structure learned in Phase 1) and DINOv2[10]'s KoLeo regularization[13] (maintaining uniform coverage of the embedding space) to Phase 2.

Muon Optimizer

We adopted the Muon optimizer[12] instead of SigLIP2's AdamW. Muon has been reported to achieve approximately 2× sample efficiency compared to Adam, making it a perfect fit for our situation of training with 10× less data. In ablation experiments, we confirmed consistent improvements of +2.83 on ImageNet and +2.75 on grounding compared to AdamW.

SigLIP2 (Original)

Optimizer: AdamW
Phase 2 regularization: —
Grounding: OWL-ViT pseudo annotations (10B samples)
Training data: WebLI (proprietary, 10B samples)

Raon-VisionEncoder (Ours)

Optimizer: Muon (~2× efficiency)
Phase 2 regularization: Gram + KoLeo
Grounding: Open-source high-quality (17.2M samples)
Training data: Open-source (1B samples)

Results: Closing in on SigLIP2

Here are the results of our final model, trained for approximately 2 weeks on 80 B200 GPUs. Despite training with 10× less data using only open data, we achieved results that match or exceed SigLIP2 on several benchmarks.

Benchmark Results

Raon-VisionEncoder (Open Data)

SigLIP2-NaFlex (Proprietary Data)

🏆 Exceeds SigLIP2

+2.12

Seg mIoU lead

≈ 17.2M vs 10B

Grounding

Smaller Dataset
Still competitive

≈ On Par

Retrieval

Competitive

🏆 EN → KO

KO ≈

English-only training
Matches Korean VLM

Gap Remains

~5%pt

in ImageNet Cls
Data scale difference?

Dense Prediction: Surpassing SigLIP2

Raon-VisionEncoder performed strongest in dense prediction tasks. It surpassed SigLIP2-NaFlex on Pascal VOC segmentation by +2.12 mIoU and on depth estimation by δ₁ +2.70.

Grounding: 17.2M Samples Against Billions

In grounding, we achieved competitive performance with SigLIP2 (AVG 75.40 vs 75.78), outperforming on 4 of 8 RefCOCO splits. This demonstrates that 17.2M high-quality open-source grounding samples can compete with the OWL-ViT pseudo annotations generated from SigLIP2's entire 10B training images.

Classification and Retrieval: A Gap Remains, but a Meaningful Start

In zero-shot classification, a gap of approximately 5 points remains compared to SigLIP2. This reflects the absolute difference in training data (1B vs 10B samples) and the quality difference of the proprietary WebLI data — results well within the expected range. In retrieval, there is also an approximately 3 point gap on COCO I2T R@1, but on Flickr30K, we achieved near-parity with SigLIP2-NaFlex.

VLM Evaluation: English-Only Training Extends to Korean

We evaluated the vision encoder's VLM downstream performance under the LLaVA-OneVision-1.5 Quick-start protocol. In English benchmarks, it outperformed SigLIP2 on MMMU (45.89 vs 43.44), RealWorldQA (66.14 vs 64.97), CV-Bench (70.90 vs 70.74), and AI2D (75.00 vs 74.90), while nearly matching on ChartQA (71.12 vs 71.48). The notable finding is in Korean benchmarks. Despite being trained exclusively on English data, it achieved parity or advantages over SigLIP2 — which was trained on multilingual data — on KRETA (49.28 vs 48.04), K-VisCUIT (63.47 vs 62.40), and DTCBench (43.75 vs 43.75).

Looking Back, and Looking Ahead

Over the course of the project, we completed the entire process from infrastructure setup to data downloading and curation, ablation experiments, and final model training. This project delivers more than a single checkpoint. By releasing the complete blueprint — including data curation code, training configurations, and model checkpoints — we aim to enable anyone to reproduce a SigLIP2-class vision encoder and build upon it.

There are still challenges ahead. While currently trained on English-only data, the model could benefit from multilingual data even on English tasks, as shown in prior works[20]. Additionally, model-in-the-loop curation, where the model itself guides data selection during training, is another direction that could improve data efficiency.

We look forward to Raon-VisionEncoder becoming excellent eyes for Raon.

Citation

Jungseok Cho, Joo Young Choi†, Han EunGi, Hyunjin Kim†, Jaeah Lee‡, Hakyoung Lee, Seonho Lee, Suekyeong Nam, Soohwan Park†, Sungchan Park, Juhyeong Sun, Moonwon Yu

KRAFTON AI

† Project leads. ‡ Seoul National University. Authors are listed in alphabetical order.

@misc{raon-visionencoder2026, title = {Raon-VisionEncoder: A Vision Encoder for Raon}, author = {Jungseok Cho and Joo Young Choi and Han EunGi and Hyunjin Kim and Jaeah Lee and Hakyoung Lee and Seonho Lee and Suekyeong Nam and Soohwan Park and Sungchan Park and Juhyeong Sun and Moonwon Yu}, year = {2026}, note = {KRAFTON AI} }

References

[1]Tschannen et al., "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features," 2025.
[2]Chen et al., "PaLI: A Jointly-Scaled Multilingual Language-Image Model," 2023.
[3]Wan et al., "LocCa: Visual Pretraining with Location-Aware Captioners," 2024.
[4]Minderer et al., "Simple Open-Vocabulary Object Detection with Vision Transformers," ECCV, 2022.
[5]Xu et al., "Demystifying CLIP Data," ICLR, 2024.
[6]Schuhmann et al., "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs," 2021.
[7]Gadre et al., "DataComp: In Search of the Next Generation of Multimodal Datasets," NeurIPS, 2023.
[8]Naeem et al., "SILC: Improving Vision Language Pretraining with Self-Distillation," 2023.
[9]Maninis et al., "TIPS: Text-Image Pretraining with Spatial Awareness," 2024.
[10]Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," TMLR, 2024.
[11]Siméoni et al., "DINOv3: Unifying Self-Supervised and Supervised Vision Transformer Features," 2025.
[12]Jordan et al., "Muon: An Optimizer for Hidden Layers in Neural Networks," 2024.
[13]Sablayrolles et al., "Spreading Vectors for Similarity Search," ICLR, 2019.
[14]Dehghani et al., "Patch n' Pack: NaViT, a Vision Transformer for Any Aspect Ratio and Resolution," NeurIPS, 2023.
[15]Rasheed et al., "GLaMM: Pixel Grounding Large Multimodal Model," CVPR, 2024.
[16]Kuznetsova et al., "The Open Images Dataset V4," IJCV, 2020.
[17]Shao et al., "Objects365: A Large-Scale, High-Quality Dataset for Object Detection," ICCV, 2019.
[18]Krishna et al., "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," IJCV, 2017.
[20]Chuang et al., "MetaCLIP 2: A Worldwide Scaling Recipe," 2025.