Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.
Survey
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Hardware and software co-design | UCB | Arxiv | ||
Transformer optimization | Iowa State Univeristy | Journal of Systems Architecture | ||
Model Compression | UCSD | Arxiv | ||
Optimization technique: quant, pruning, continuous batching, virtual memory | CMU | Arxiv | ||
Performance analysis | Infinigence-AI | Arxiv |
Framework
Paper/OpenSource Project | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Deepspeed; Kerenl Fusion | MicroSoft | SC 2022 | ||
Deepspeed; Split fuse | MicroSoft | Arxiv | ||
vLLM; pagedAttention | UCB | SOSP 2023 | ||
NVIDIA | ||||
Shanghai Artifcial Intelligence Laboratory | ||||
TVM; Multi-platforms | MLC-Team | |||
Huggingface |
Serving
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Distributed inference serving | PKU | Arxiv | ||
Pipeline Parallel; Auto parallel | UCB | OSDI 2023 | ||
Continuous batching | Seoul National University | OSDI2022 | ||
Multiple Decoding Heads | Princeton University | Arxiv | ||
Consumer-grade GPU | SJTU | Arxiv | ||
flash; Pruning | Apple | Arxiv | ||
Length Perception | NUS | NeurIPS 2023 | ||
Harvard University | Arxiv | |||
Decouple | PKU | OSDI 2024 | ||
Decouple | UW | OSDI 2024 | ||
Agent Language | UCB | Arxiv | ||
Single GPU | Stanford University | Arxiv | ||
Decouple | GaTech | OSDI 2024 | ||
Preemptible GPU | CMU | ASPLOS 2024 | ||
Tree-based Speculative | CMU | ASPLOS 2024 | ||
Cache the multi-turn prefill KV-cache in host-DRAM and SSD | NUS | Arxiv | ||
Use spatial-temporal multiplexing method to serve multi-LLMs | MMLab | Arxiv |
Operating System
Paper | Keywords | Institute(first) | Publication | Others |
---|---|---|---|---|
OS; LLM Agent | Rutgers University | Arxiv |
Transformer accelerate
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Tencent | PPoPP 2021 | |||
FlashAttention; Online Softmax | Stanford University | NeurIPS 2023 | ||
Stanford University | Arxiv | |||
Softmax with Unified Maximum Value | Tsinghua University | Mlsys 2024 | ||
FFT; TensorCore; Long Sentences | Stanford University | Arxiv | ||
Georgia Institute of Technology | ASPLOS 2023 | |||
Variable-Length Inputs | UCR | PPoPP 2022 | ||
MQA | Arxiv | |||
GQA | Google Research | ACL 2023 | ||
ByteDance | NAACL 2021 | |||
ByteDance | SC 2022 | |||
Blockwise transformer | UCB | NeurIPS 2023 | ||
Dynamic Memory Management | Microsoft Research India | Arxiv |
Model Compression
Quant
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
SJTU | Arxiv | |||
Dynamic Compression | NVIDIA | Arxiv |
Punrning/sparisity
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Univeristy of Sydney | VLDB 2024 | |||
Consistency | Shanghai Jiao Tong University | Arxiv | ||
Disaggregating Prefill and Decoding | PKU | Arxiv |
Communication
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Overlap | ASPLOS 2023 | |||
Scaling | Mlsys 2023 | |||
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication | communication partition | PKU | ASPLOS 2024 |
Energy
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Yale University | NSDI 2023 |
Decentralized
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Consumer-grade GPU | HKBU | Arxiv | ||
Yandex | Arxiv |
Serveless
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
The University of Edinburgh | Arxiv |
Last updated