Career Profile

I'm experienced in parallel algorithm designs on heterogeneous architectures like the Sunway supercomputer, GPU, multi-core CPU, and FPGA processors to solve computational challenges raised from geoscience applications. I am also interested in novel designs for financial applications on reconfigurable platforms. You can checkout my CV in English or in 中文.


Lead Software Developer

2015 - Present
National Supercomputing Center in Wuxi

Responsible for the underlying architecture of the Sunway TaihuLight supercomputer and the softwares/applications that are deployed and run on the Sunway TaihuLight. Also responsible for overseeing the work being done by any other software engineers in my team.

Unleashing the performance of different parts (computation, communication, IO, bandwidth, etc.) of the Sunway supercomputer. Optimizing the performance of applications with AAA principles (Architecture, Application, Algorithm).

FPGA Application Developer (Intern)

2016.11 - 2017.6
Maxeler Technologies, London

Maxeler Technologies is a leading provider of dataflow computing platforms, solutions, and appliances. I am responsible for developing and testing the MaxMPT project collaborated with Chicago Mercantile Exchange(CME). I also contribute code to the NetworkingCodeExamples and maxpower.

HPC Engineer (Intern)

2014 - 2015
Statoil, Beijing

Statoil is an international energy company and the world's largest offshore operator. I am responsible for designing an efficient parallel scheme for the beam migration and reverse time migration algorithms and then fully optimize them on a GPU cluster.


18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight - This work shows our large-scale nonlinear earthquake simulation software on Sunway TaihuLight that achieves over 15% of the system's peak, better than the 11.8% efficiency achieved by a similar software running on Titan, whose byte to flop ratio is 5 times better than TaihuLight. The extreme cases demonstrate a sustained performance of over 18.9 Pflops, enabling the simulation of Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution. Related work is accepted by SC17', and wining the Gordon Bell Prize.
An FPGA-based Extremely Low-latency Market Server - Cooperated with China Financial Future Exchange (CFFEX). We design and implement a novel and efficient CPU-FPGA hybrid data structure for order book. Our FPGA-based market server sustains 1-10Gb/s bandwidth with latency of 3ms, providing an 30-fold latency reduction compared to a fully optimized CPU-based solution.
Ensemble Full Waveform Inversion with Source Encoding - This approach refines the velocity model iteratively by incorporating the observation, while the nonlinear evolution of the covariance is approximated by ensemble covariance. Encoded simultaneous-source FWI (ESSFWI) is applied to improve the representation for the low rank ensemble approximation, and to increase the rate of convergence. Experiments show that EnFWI achieves larger convergence range and better tolerance to data noise with less computational costs than traditional FWI methods.
Parallel GPU Beam Migration - A fast parallel beam migration that runs efficiently on GPU clusters. The significant performance improvement would further close the gap to an interactive migration engine.

Skills & Proficiency

C/C++ & Bash

MPI & OpenMP


HPC & Optimization


Deep Learning