Mission: Give Raon Eyes
This project began to give KRAFTON AI's Raon "eyes" that could understand the world. For a model to converse with the world beyond text, a powerful vision encoder that precisely understands images is essential. Our goal was to reproduce Google's SigLIP2[1], the current state-of-the-art vision encoder.
💡 Raon-VisionEncoder project goal: Train a SigLIP2-class vision encoder from scratch using only open data, and make the entire process public.
The Target Is SigLIP2, but Key Ingredients Are Missing
SigLIP2 is a model that demonstrates top-tier performance across a wide range of vision benchmarks including zero-shot classification, retrieval, segmentation, and grounding. Its training pipeline is well-organized into three phases: Phase 1 builds fundamentals with contrastive learning and LocCa[3] loss, Phase 2 learns dense features through self-supervised learning (SILC[8], TIPS[9]), and Phase 3 adapts to various resolutions with NaFlex[14].
The problem was that while we had the recipe, the key ingredients were missing.
Missing Ingredient 1: 10B-Scale Training Data
SigLIP2 is trained on Google's proprietary dataset, WebLI[2] — 10 billion image-text pairs that we had no access to whatsoever. We had to match this scale using only open data.
Missing Ingredient 2: Grounding Data
SigLIP2's LocCa decoder learns to indicate and describe specific regions in images using bounding boxes. To achieve this, SigLIP2 runs OWL-ViT[4], a large-scale object detector, across the entire training dataset to generate billions of pseudo annotations. Estimated at over 150 days on 8 H200 GPUs, this requires enormous compute resources.
Missing Ingredient 3: Training Code
While SigLIP2's paper describes the training recipe in detail, the training code itself is not publicly available. From the 3-phase training pipeline, to the combination of various loss functions, to NaFlex resolution adaptation — everything had to be implemented from scratch based on the paper's descriptions.
Preparing the Ingredients: Assembling a 10.7B Data Pool
The WebLI data used to train SigLIP2 has an undisclosed composition, making direct use difficult. As an alternative, we decided to follow the approach of MetaCLIP[5], which has a publicly available large-scale data acquisition and curation pipeline.
The data pool was built on LAION-400M (280M)[6] and DataComp-1B (1.4B)[7], and to fill the remaining gap, we directly processed Common Crawl web archives. We extracted image-text pairs from a total of 42 snapshots spanning 2017 to 2026, yielding approximately 9 billion images.
Downloading and processing 9 billion images is a task that, if approached naively, could consume tens of days on its own. To prevent spending the entire project timeline on data download alone, we had to solve numerous engineering challenges one by one. We built an optimized pipeline and ran 56 machines for approximately 10 days to acquire about 9 billion additional images.
After assembling the total 10.7B image pool, we applied MetaCLIP balancing and CLIP score-based filtering to select high-quality training data, extracting meaningful data from the entire pool through a curation process.
Grounding Data: Quality Over Quantity
Since we couldn't reproduce SigLIP2's billions of pseudo annotations, we changed our approach. We collected high-quality human-annotated data from open datasets such as OpenImages[16], Objects365[17], and Visual Genome[18], and added model-generated annotations from the GranD[15] dataset, combining a total of 17.2M grounding data samples. While this is an extremely small scale compared to SigLIP2's billions, our strategy was to compete on annotation quality.
SigLIP2
- Auto-generated from entire training data using OWL-ViT
- 150+ days on 8 H200 GPUs
- Quality limited by detector performance
Raon-VisionEncoder
- Primarily human-created ground truth annotations
- No additional GPU compute
- High-quality grounding data
Beyond the Recipe: Our Additions
We didn't stop at faithfully reproducing SigLIP2's recipe — we introduced three key improvements.
Separate Grounding Batch Construction
In SigLIP2, every training image has pseudo annotations, so the entire batch participates in all tasks. Since we use grounding data from separate sources, we introduced a mixed batch strategy where a portion of the batch is composed of grounding-only samples. Grounding samples only perform referring expression prediction and grounded captioning, and are excluded from contrastive learning.
Gram Anchoring and KoLeo Regularization
SigLIP2's self-supervised learning objectives (SILC[8], TIPS[9]) are derived from the DINO family of methods, but they lack mechanisms to preserve global feature correlation structure or prevent mode collapse in the embedding space. To address this, we added DINOv3[11]'s Gram Anchoring (preserving the feature correlation structure learned in Phase 1) and DINOv2[10]'s KoLeo regularization[13] (maintaining uniform coverage of the embedding space) to Phase 2.
Muon Optimizer
We adopted the Muon optimizer[12] instead of SigLIP2's AdamW. Muon has been reported to achieve approximately 2× sample efficiency compared to Adam, making it a perfect fit for our situation of training with 10× less data. In ablation experiments, we confirmed consistent improvements of +2.83 on ImageNet and +2.75 on grounding compared to AdamW.
SigLIP2 (Original)
- Optimizer: AdamW
- Phase 2 regularization: —
- Grounding: OWL-ViT pseudo annotations (10B samples)
- Training data: WebLI (proprietary, 10B samples)
Raon-VisionEncoder (Ours)
- Optimizer: Muon (~2× efficiency)
- Phase 2 regularization: Gram + KoLeo
- Grounding: Open-source high-quality (17.2M samples)
- Training data: Open-source (1B samples)
Results: Closing in on SigLIP2
Here are the results of our final model, trained for approximately 2 weeks on 80 B200 GPUs. Despite training with 10× less data using only open data, we achieved results that match or exceed SigLIP2 on several benchmarks.
Still competitive
Matches Korean VLM
Data scale difference?
Dense Prediction: Surpassing SigLIP2
Raon-VisionEncoder performed strongest in dense prediction tasks. It surpassed SigLIP2-NaFlex on Pascal VOC segmentation by +2.12 mIoU and on depth estimation by δ₁ +2.70.
Grounding: 17.2M Samples Against Billions
In grounding, we achieved competitive performance with SigLIP2 (AVG 75.40 vs 75.78), outperforming on 4 of 8 RefCOCO splits. This demonstrates that 17.2M high-quality open-source grounding samples can compete with the OWL-ViT pseudo annotations generated from SigLIP2's entire 10B training images.
Classification and Retrieval: A Gap Remains, but a Meaningful Start
In zero-shot classification, a gap of approximately 5 points remains compared to SigLIP2. This reflects the absolute difference in training data (1B vs 10B samples) and the quality difference of the proprietary WebLI data — results well within the expected range. In retrieval, there is also an approximately 3 point gap on COCO I2T R@1, but on Flickr30K, we achieved near-parity with SigLIP2-NaFlex.
VLM Evaluation: English-Only Training Extends to Korean
We evaluated the vision encoder's VLM downstream performance under the LLaVA-OneVision-1.5 Quick-start protocol. In English benchmarks, it outperformed SigLIP2 on MMMU (45.89 vs 43.44), RealWorldQA (66.14 vs 64.97), CV-Bench (70.90 vs 70.74), and AI2D (75.00 vs 74.90), while nearly matching on ChartQA (71.12 vs 71.48). The notable finding is in Korean benchmarks. Despite being trained exclusively on English data, it achieved parity or advantages over SigLIP2 — which was trained on multilingual data — on KRETA (49.28 vs 48.04), K-VisCUIT (63.47 vs 62.40), and DTCBench (43.75 vs 43.75).
Looking Back, and Looking Ahead
Over the course of the project, we completed the entire process from infrastructure setup to data downloading and curation, ablation experiments, and final model training. This project delivers more than a single checkpoint. By releasing the complete blueprint — including data curation code, training configurations, and model checkpoints — we aim to enable anyone to reproduce a SigLIP2-class vision encoder and build upon it.
There are still challenges ahead. While currently trained on English-only data, the model could benefit from multilingual data even on English tasks, as shown in prior works[20]. Additionally, model-in-the-loop curation, where the model itself guides data selection during training, is another direction that could improve data efficiency.
We look forward to Raon-VisionEncoder becoming excellent eyes for Raon.
Citation
Jungseok Cho, Joo Young Choi†, Han EunGi, Hyunjin Kim†, Jaeah Lee‡, Hakyoung Lee, Seonho Lee, Suekyeong Nam, Soohwan Park†, Sungchan Park, Juhyeong Sun, Moonwon Yu
KRAFTON AI
† Project leads. ‡ Seoul National University. Authors are listed in alphabetical order.
References
- [1]Tschannen et al., "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features," 2025.
- [2]Chen et al., "PaLI: A Jointly-Scaled Multilingual Language-Image Model," 2023.
- [3]Wan et al., "LocCa: Visual Pretraining with Location-Aware Captioners," 2024.
- [4]Minderer et al., "Simple Open-Vocabulary Object Detection with Vision Transformers," ECCV, 2022.
- [5]Xu et al., "Demystifying CLIP Data," ICLR, 2024.
- [6]Schuhmann et al., "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs," 2021.
- [7]Gadre et al., "DataComp: In Search of the Next Generation of Multimodal Datasets," NeurIPS, 2023.
- [8]Naeem et al., "SILC: Improving Vision Language Pretraining with Self-Distillation," 2023.
- [9]Maninis et al., "TIPS: Text-Image Pretraining with Spatial Awareness," 2024.
- [10]Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," TMLR, 2024.
- [11]Siméoni et al., "DINOv3: Unifying Self-Supervised and Supervised Vision Transformer Features," 2025.
- [12]Jordan et al., "Muon: An Optimizer for Hidden Layers in Neural Networks," 2024.
- [13]Sablayrolles et al., "Spreading Vectors for Similarity Search," ICLR, 2019.
- [14]Dehghani et al., "Patch n' Pack: NaViT, a Vision Transformer for Any Aspect Ratio and Resolution," NeurIPS, 2023.
- [15]Rasheed et al., "GLaMM: Pixel Grounding Large Multimodal Model," CVPR, 2024.
- [16]Kuznetsova et al., "The Open Images Dataset V4," IJCV, 2020.
- [17]Shao et al., "Objects365: A Large-Scale, High-Quality Dataset for Object Detection," ICCV, 2019.
- [18]Krishna et al., "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," IJCV, 2017.
- [20]Chuang et al., "MetaCLIP 2: A Worldwide Scaling Recipe," 2025.