About Me

I am a Young Leading Scientist at the Shanghai AI Lab and an Adjunct Doctoral Supervisor at School of AI, SJTU. Recognized as a National-level Young Talent, I hold a Ph.D. from Tsinghua University and was a visiting researcher at Stanford University and Imperial College London.

My research in data-centric AI and high-performance computing has driven significant technological innovation and industry impact. I have authored over 150 papers in top-tier venues, garnered over 9,000 citations on Google Scholar, and my open-source projects have attracted a community of over 50,000 stars on GitHub. My accolades include the Gordon Bell Prize, an ACL Best Theme Paper Award, and the WAIC Yunfan Award. I am the creator of MinerU , the world’s leading open-source data engine for large models. This work, in conjunction with OpenDataLab, has significantly influenced the AI and open-source landscape. Additionally, I oversee a dedicated data team that curates high-quality datasets for leading models such as InternLM and InternVL.

We are hiring! I am actively seeking talented Ph.D. students, postdoctoral fellows, interns, and full-time researchers. If you are passionate about building the future of AI, I welcome you to contact me via email.

🔥 Recent News

  • 2025.07:  🎉 I received the ACL Best Theme Paper Award [1].
  • 2025.07:  🎉 I won the World Artificial Intelligence Conference Yunfan Award (one of 11 global recipients under the age of 35, 2025)
  • 2025.05:  🎉 [1][2][3][4][5] papers are accepted by ICCV 2025.
  • 2025.05:  🎉 [1][2][3][4][5][6][7][8][9][10][11] papers are accepted by ACL 2025.
  • 2025.02:  🎉 [1][2][3][4][5] papers are accepted by CVPR 2025.
  • 2025.01:  🎉 [1] papers is accepted by NACCL 2025.
  • 2025.01:  🎉 [1][2][3][4][5][6][7]papers are accepted by ICLR 2025.

💻 Open-source Projects

  • MinerU , the world’s leading open-source data parsing engine for LLM/Rag/Agent.
  • InternLM , a series of leading LLM models developed by Shanghai AI Laboratory.
  • OpenDataLab, an open platform that facilitates the development of AGI by sharing datasets and open-sourced tools. It hosts over 7700 datasets and provides 50+ million data retrieval services to over 200,000 developers.

📝 Selected Publications

I have authored over 150 papers in top-tier venues, garnered over 9,000 citations on google scholar. Following are selected publicatioins. († Corresponding Authors)

  1. ACL 2025 Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models, Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He† (ACL best theme paper 🎉)
  2. ICLR 2025 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text, Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, …, Conghui He†, Jifeng Dai†
  3. ECCV 2024 Mmbench: Is your multi-modal model an all-around player?, Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin (over 1000+ citations 🎉)
  4. ECCV 2024 Sharegpt4v: Improving large multi-modal models with better captions, Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin
  5. SC 2017, 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios, Haohuan Fu†, Conghui He†, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue†, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen (Gordon Bell Prize 🎉)

🎖 Selected Honors

  • 2025, ACL Best Theme Paper (3/8000)
  • 2025, World Artificial Intelligence Conference Yunfan Award (one of 11 global recipients under the age of 35)
  • 2023, SenseTime Award (Sensetime’s highest award, 1 team from 100 teams)
  • 2019, Tencent Technology Breakthrough Award - Gold Prize (highest technical award, 1 team from 50 teams)
  • 2018, Outstanding Graduate PhD Student Award
  • 2017, ACM Gordon Bell Prize (the highest award in the field of HPC applications)
  • 2017, National PhD Scholarship (1%)
  • 2013, Global Champion of the IEEE-IBM Smarter Planet Challenge (Team Leader, 1/54)