I am currently a Research Director at SenseTime Inc., as well as a Research Scientist and PI at the Shanghai AI Laboratory. Prior to this, I worked at WeChat as a Senior Researcher, where I initiated and developed the high-performance graph computing framework, Plato. Before joining WeChat, I earned my PhD degree (2013-2018) from the Department of Computer Science at Tsinghua University under the supervision of Prof. Haohuan Fu, and my Bachelor’s degree (2009-2013) from the Department of Software Engineering at Sun Yat-Sen University.

My research interests include High Performance Computing, Computer Vision, and Large Language Models. In 2017, I was honored with the Gordon Bell Prize , which is the highest distinction in the high-performance computing application domain. Currently, I lead the OpenDataLab team, which aims to build an influential open dataset platform that facilitates the development, analysis and research of Artificial General Intelligence (AGI). Additionally, I oversee a data team that collects and curates massive datasets for large language models.

At SenseTime and the Shanghai AI Laboratory, we are actively hiring PhDs, postdocs, interns, and full-time researchers. If you’re interested in joining our team, please feel free to reach out to me via email.

🔥 News

  • 2024.09:  🎉 [1] papers is accepted by NeurlPS 2024.
  • 2024.09:  🎉 [1][2][3] papers are accepted by EMNLP 2024.
  • 2024.07:  🎉 [1†]2†[3][4] papers are accepted by ECCV 2024.
  • 2024.05:  🎉 [1] paper is accepted by ICML 2024.
  • 2024.05:  🎉 [1†][2] papers are accepted by ACL 2024.
  • 2024.03:   We release Wanjuan-CC, a safe and high-quality Webtext dataset.
  • 2024.02:  🎉 [1†][2][3] papers are accepted by CVPR 2024.
  • 2023.09:   We release InternLM2. See arXiv for details.
  • 2023.09:  🎉 [1] paper is accepted by AAAI 2024.
  • 2023.08:   We release Wanjuan 1.0, a large-scale multi-modal dataset for pretraining.
  • 2023.06:   We release InternLM. You can find technical report here.
  • 2022.03:   We launch OpenDataLab, an open data platform that enpowers AGI.

💻 Projects

  • OpenDataLab, an open platform that facilitates the development of AGI by sharing datasets and open-sourced tools. It hosts over 7700 datasets and provides 50+ million data retrieval services to over 40,000 developers.
  • MinerU , a one-stop, open-sourced and high-quality data extraction tool that supports PDF, webpage and e-book extraction. It is widely used in RAG as well as in training LLMs.
  • InternLM , a series of 7B and 20B base and chat models, featuring outstanding reasoning capability, 1M context window and the ability to use tools.
  • PDF-Extract-Kit , a comprehensive toolkit for high-quality PDF content extraction library.

📝 Publications (Google Scholar)

For a full list, please refer to my google scholar. (* Interns & Students, † Corresponding Authors)

📚 Large Language Model (LLM)

  1. ECCV 2024 Mmbench: Is your multi-modal model an all-around player?, Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin
  2. ECCV 2024 Sharegpt4v: Improving large multi-modal models with better captions, Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin
  3. ACL 2024 Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations, Jiaxing Sun*, Weiquan Huang, Jiang Wu, Chenya Gu, Wei Li, Songyang Zhang, Hang Yan, and Conghui He†.
  4. AAAI 2024 Vigc: Visual instruction generation and correction, Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He†
  5. ECCV 2024 Parrot Captions Teach CLIP to Spot Text, Yiqi Lin*, Conghui He†, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou
  6. CVPR 2023 Omnicity: Omnipotent city understanding with multi-level and multi-view images, Weijia Li, Yawen Lai, Linning Xu, Yuanbo Xiangli, Jinhua Yu, Conghui He†, Gui-Song Xia†, and Dahua Lin.

💻 High Performance Computing (HPC)

  1. SC 2017 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios,Haohuan Fu†, Conghui He†, Bingwei Chen, Zekun Yin, Zhenguo Zhang et al. (ACM Gordon Bell Prize Award)
  2. TC 2017 A fully-pipelined hardware design for gaussian mixture models, Conghui He, Haohuan Fu, Ce Guo, Wayne Luk, Guangwen Yang
  3. BigData 2019 Finding Mutual X at WeChat-Scale Social Network in Ten Minutes, Conghui He, Shijie Sun, Benli Li, Xiaogang Tu, Donghai Yu
  4. FCCM 2017 A Nanosecond-level Hybrid Table Design for Financial Market Data Generators, Haohuan Fu, Conghui He, Wayne Luk, Weijia Li, and Guangwen Yang

🎖 Honors and Awards

  • 2023, SenseTime Award (Sensetime’s highest award, 1 team from 100 teams)
  • 2021, Outstanding Team Award at SenseTime (10 teams from 200 teams)
  • 2019, Tencent Technology Breakthrough Award - Gold Prize (highest technical award, 1 team from 50 teams)
  • 2018, Outstanding Graduate PhD Student Award
  • 2017, ACM Gordon Bell Prize (the highest award in the field of HPC applications)
  • 2017, National PhD Scholarship (1%)
  • 2013, Global Champion of the IEEE-IBM Smarter Planet Challenge (Team Leader, 1/54)
  • 2010, National Scholarship (1%)