Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks Z Chen, J Wu, W Wang, W Su, G Chen, S Xing, M Zhong, Q Zhang, X Zhu, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 383* | 2024 |
Internvideo: General video foundation models via generative and discriminative learning Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao, H Zhang, J Xu, Y Liu, Z Wang, ... arXiv preprint arXiv:2212.03191, 2022 | 281 | 2022 |
Internvid: A large-scale video-text dataset for multimodal understanding and generation Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li, G Chen, X Chen, Y Wang, C He, ... ICLR2023, 2023 | 155 | 2023 |
Mvbench: A comprehensive multi-modal video understanding benchmark K Li, Y Wang, Y He, Y Li, Y Wang, Y Liu, Z Wang, J Xu, G Chen, P Luo, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 150 | 2024 |
Videollm: Modeling video sequence with large language models G Chen, YD Zheng, J Wang, J Xu, Y Huang, J Pan, Y Wang, Y Wang, ... arXiv preprint arXiv:2305.13292, 2023 | 71 | 2023 |
Dcan: improving temporal action detection via dual context aggregation G Chen, YD Zheng, L Wang, T Lu Proceedings of the AAAI conference on artificial intelligence 36 (1), 248-257, 2022 | 64 | 2022 |
Internvideo2: Scaling video foundation models for multimodal video understanding Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei, R Zheng, J Xu, Z Wang, ... arXiv preprint arXiv:2403.15377, 2024 | 59 | 2024 |
Video mamba suite: State space model as a versatile alternative for video understanding G Chen, Y Huang, J Xu, B Pei, Z Chen, Z Li, J Wang, K Li, T Lu, L Wang arXiv preprint arXiv:2403.09626, 2024 | 47 | 2024 |
Basictad: an astounding rgb-only baseline for temporal action detection M Yang, G Chen, YD Zheng, T Lu, L Wang Computer Vision and Image Understanding 232, 103692, 2023 | 40 | 2023 |
Internvideo-ego4d: A pack of champion solutions to ego4d challenges G Chen, S Xing, Z Chen, Y Wang, K Li, Y Li, Y Liu, J Wang, YD Zheng, ... arXiv preprint arXiv:2211.09529, 2022 | 40 | 2022 |
Avsegformer: Audio-visual segmentation with transformer S Gao, Z Chen, G Chen, W Wang, T Lu Proceedings of the AAAI Conference on Artificial Intelligence 38 (11), 12155 …, 2024 | 31 | 2024 |
Memory-and-anticipation transformer for online action understanding J Wang*, G Chen*, Y Huang, L Wang, T Lu Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2023 | 27 | 2023 |
FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation Z Chen, J Wang, W Wang, G Chen, E Xie, P Luo, T Lu arXiv preprint arXiv:2111.02394, 2021 | 20 | 2021 |
Retrieval-augmented egocentric video captioning J Xu, Y Huang, J Hou, G Chen, Y Zhang, R Feng, W Xie Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 14 | 2024 |
EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World Y Huang, G Chen, J Xu, M Zhang, L Yang, B Pei, H Zhang, L Dong, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 12 | 2024 |
Mrsn: Multi-relation support network for video action detection YD Zheng, G Chen, M Yuan, T Lu 2023 IEEE International Conference on Multimedia and Expo (ICME), 1026-1031, 2023 | 10 | 2023 |
Egovideo: Exploring egocentric foundation model and downstream adaptation B Pei, G Chen, J Xu, Y He, Y Liu, K Pan, Y Huang, Y Wang, T Lu, L Wang, ... arXiv preprint arXiv:2406.18070, 2024 | 3 | 2024 |
Matching Compound Prototypes for Few-Shot Action Recognition Y Huang, L Yang, G Chen, H Zhang, F Lu, Y Sato International Journal of Computer Vision, 1-26, 2024 | 2 | 2024 |
Champion solution for the WSDM2023 toloka VQA challenge S Gao, Z Chen, G Chen, W Wang, T Lu arXiv preprint arXiv:2301.09045, 2023 | 2 | 2023 |