Research
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu
under review
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar
under review
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
Cheng Luo, Jiawei Zhao, Zhuoming Cheng, Beidi Cheng, Anima Anandkumar
accpeted by NeurIPS 2024
RTP: Rethinking Tensor Parallelism with Memory Deduplicationn
Cheng Luo, Tianle Zhong, Geoffrey Fox
Arxiv 2023
Towards efficient deep neural network training by FPGA-based batch-level parallelism
Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo
FCCM 2020