Pali: A jointly-scaled multilingual language-image model X Chen, X Wang, S Changpinyo, AJ Piergiovanni, P Padlewski, D Salz, ... ICLR 2023 (Oral), 2022 | 693 | 2022 |
Scaling vision transformers to 22 billion parameters M Dehghani, J Djolonga, B Mustafa, P Padlewski, J Heek, J Gilmer, ... ICML 2023 (Oral), 2023 | 604 | 2023 |
LiT: Zero-Shot Transfer with Locked-image Text Tuning X Zhai, X Wang, B Mustafa, A Steiner, D Keysers, A Kolesnikov, L Beyer CVPR 2022, 2021 | 589 | 2021 |
Simple Open-Vocabulary Object Detection with Vision Transformers M Minderer, A Gritsenko, A Stone, M Neumann, D Weissenborn, ... ECCV 2022, 2022 | 540* | 2022 |
Measuring compositional generalization: A comprehensive method on realistic data D Keysers, N Schärli, N Scales, H Buisman, D Furrer, S Kashubin, ... ICLR 2020, 2019 | 392 | 2019 |
Pali-x: On scaling up a multilingual vision and language model X Chen, J Djolonga, P Padlewski, B Mustafa, S Changpinyo, J Wu, ... CVPR 2024, 2023 | 177 | 2023 |
Paligemma: A versatile 3b vlm for transfer L Beyer, A Steiner, AS Pinto, A Kolesnikov, X Wang, D Salz, M Neumann, ... arXiv preprint arXiv:2407.07726, 2024 | 147 | 2024 |
Pali-3 vision language models: Smaller, faster, stronger X Chen, X Wang, L Beyer, A Kolesnikov, J Wu, P Voigtlaender, B Mustafa, ... arXiv preprint arXiv:2310.09199, 2023 | 86 | 2023 |
No filter: Cultural and socioeconomic diversityin contrastive vision-language models A Pouget, L Beyer, E Bugliarello, X Wang, AP Steiner, X Zhai, ... NeurIPS 2024, 2024 | 14 | 2024 |
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? I Alabdulmohsin, X Wang, A Steiner, P Goyal, A D'Amour, X Zhai ICLR 2024, 2024 | 12 | 2024 |
Three Towers: Flexible Contrastive Learning with Pretrained Image Models J Kossen, M Collier, B Mustafa, X Wang, X Zhai, L Beyer, A Steiner, ... NeuIPS 2023, 2023 | 10 | 2023 |
A study of autoregressive decoders for multi-tasking in computer vision L Beyer, B Wan, G Madan, F Pavetic, A Steiner, A Kolesnikov, AS Pinto, ... arXiv preprint arXiv:2303.17376, 2023 | 8 | 2023 |
Paligemma 2: A family of versatile vlms for transfer A Steiner, AS Pinto, M Tschannen, D Keysers, X Wang, Y Bitton, ... arXiv preprint arXiv:2412.03555, 2024 | 7 | 2024 |
LocCa: Visual Pretraining with Location-aware Captioners B Wan, M Tschannen, Y Xian, F Pavetic, I Alabdulmohsin, X Wang, ... NeurIPS 2024, 2024 | 5 | 2024 |
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features XZ Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem ... | | 2025 |
Scaling Pre-training to One Hundred Billion Data for Vision Language Models X Wang, I Alabdulmohsin, D Salz, Z Li, K Rong, X Zhai arXiv preprint arXiv:2502.07617, 2025 | | 2025 |
Locked-Model Multimodal Contrastive Tuning D Keysers, X Zhai, X Wang, L Beyer, B Mustafa, A Steiner, A Kolesnikov US Patent App. 18/051,106, 2024 | | 2024 |