Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUs

Published in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2023

Liberator is a data reuse framework that enables efficient out-of-memory graph computing on GPUs by intelligently managing data transfers between host and device memory.

S. Li, R. Tang, J. Zhu, Z. Zhao, X. Gong, W. Wang, J. Zhang, P.-C. Yew. "Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUs." IEEE Transactions on Parallel and Distributed Systems (TPDS), 34(6): 1954-1967, 2023.
Download Paper | Code

OneGraph: A Cross-Architecture Framework for Large-Scale Graph Computing on GPUs Based on oneAPI

Published in CCF Transactions on High Performance Computing (CCF-THPC), 2024

OneGraph is a cross-architecture graph computing framework built on oneAPI that enables portable and efficient large-scale graph processing across different GPU architectures.

S. Li, J. Zhu, J. Han, Y. Peng, Z. Wang, X. Gong, G. Wang, J. Zhang, X. Wang. "OneGraph: A Cross-Architecture Framework for Large-Scale Graph Computing on GPUs Based on oneAPI." CCF Transactions on High Performance Computing (CCF-THPC), 6(2): 179-191, 2024.
Download Paper

DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUs

Published in ACM International Conference on Supercomputing (ICS), 2025

DR-CircuitGNN accelerates the training of heterogeneous circuit graph neural networks on GPUs through novel data reuse strategies and GPU-optimized computation kernels.

Y. Luo, S. Li, J. Tao, K. G. Thorat, X. Xie, H. Peng, N. Xu, C. Ding, S. Huang. "DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUs." In Proceedings of the 39th ACM International Conference on Supercomputing (ICS '25), 2025.
Download Paper

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Published in arXiv preprint, 2025

CudaForge is an agentic framework that automatically optimizes CUDA kernels using LLM-based agents with hardware profiling feedback, achieving significant speedups over hand-tuned baselines.

Z. Zhang, R. Wang, S. Li, Y. Luo, M. Hone, C. Ding. "CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization." arXiv preprint arXiv:2511.01884, 2025.
Download Paper

XuanJia: A Comprehensive Virtualization-Based Code Obfuscator for Binary Protection

Published in arXiv preprint, 2026

We present XuanJia, a comprehensive virtualization-based code obfuscator that leverages virtual machine protection to safeguard binary programs against reverse engineering and tampering attacks.

X. Zou, X. Gong, J. Zhang, S. Li, P.-C. Yew. "XuanJia: A Comprehensive Virtualization-Based Code Obfuscator for Binary Protection." arXiv preprint arXiv:2601.10261, 2026.
Download Paper

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programming Framework with Rubric-based Agentic Reinforcement Learning

Published in International Conference on Machine Learning (ICML 2026), 2026

Accepted to ICML 2026. StitchCUDA is a multi-agent framework for end-to-end GPU program optimization using rubric-based agentic reinforcement learning, achieving ~1.72x speedup over multi-agent baselines and ~2.73x over RL model baselines on KernelBench.

S. Li, Z. Zhang, W. Chen, Y. Luo, M. Hong, C. Ding. "StitchCUDA: An Automated Multi-Agents End-to-End GPU Programming Framework with Rubric-based Agentic Reinforcement Learning." International Conference on Machine Learning (ICML), 2026.
Download Paper | Code

GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph

Published in ACM/IEEE Design Automation Conference (DAC), 2026

GSR-GNN enables training GNNs with up to hundreds of layers on circuit graphs while reducing both compute and memory overhead, achieving up to 87.2% peak memory reduction and over 30x training speedup.

Y. Luo, S. Li, Y. Feng, V. Kancharla, S. Huang, C. Ding. "GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph." In Proceedings of the 63rd ACM/IEEE Design Automation Conference (DAC '26), 2026.
Download Paper

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015