Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.

Survey

Paper	Keywords	Institute (first)	Publication	Others
Full Stack Optimization for Transformer Inference: a Survey	Hardware and software co-design	UCB	Arxiv
A survey of techniques for optimizing transformer inference	Transformer optimization	Iowa State Univeristy	Journal of Systems Architecture
A Survey on Model Compression for Large Language Models	Model Compression	UCSD	Arxiv
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems	Optimization technique: quant, pruning, continuous batching, virtual memory	CMU	Arxiv
LLM Inference Unveiled: Survey and Roofline Model Insights	Performance analysis	Infinigence-AI	Arxiv	LLMViewer

Paper

Keywords

Institute (first)

Publication

Others

Full Stack Optimization for Transformer Inference: a Survey

Hardware and software co-design

UCB

Arxiv

A survey of techniques for optimizing transformer inference

Transformer optimization

Iowa State Univeristy

Journal of Systems Architecture

A Survey on Model Compression for Large Language Models

Model Compression

UCSD

Arxiv

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Optimization technique: quant, pruning, continuous batching, virtual memory

CMU

Arxiv

LLM Inference Unveiled: Survey and Roofline Model Insights

Performance analysis

Infinigence-AI

Arxiv

LLMViewer

Framework

Paper/OpenSource Project	Keywords	Institute (first)	Publication	Others
DeepSpeed Infernce: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	Deepspeed; Kerenl Fusion	MicroSoft	SC 2022	Github repo
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference	Deepspeed; Split fuse	MicroSoft	Arxiv	Github repo
Efficient Memory Management for Large Language Model Serving with PagedAttention	vLLM; pagedAttention	UCB	SOSP 2023	Github repo
TensorRT-LLM/FastTransformer		NVIDIA
lightLLM		Shanghai Artifcial Intelligence Laboratory
MLC LLM	TVM; Multi-platforms	MLC-Team
Text-Generation-Inference(TGI)		Huggingface

Paper/OpenSource Project

Keywords

Institute (first)

Publication

Others

DeepSpeed Infernce: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Deepspeed; Kerenl Fusion

MicroSoft

SC 2022

Github repo

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Deepspeed; Split fuse

MicroSoft

Arxiv

Github repo

Efficient Memory Management for Large Language Model Serving with PagedAttention

vLLM; pagedAttention

UCB

SOSP 2023

Github repo

TensorRT-LLM/FastTransformer

NVIDIA

lightLLM

Shanghai Artifcial Intelligence Laboratory

MLC LLM

TVM; Multi-platforms

MLC-Team

Text-Generation-Inference(TGI)

Huggingface

Serving

Paper	Keywords	Institute (first)	Publication	Others
Fast Distributed Inference Serving for Large Language Models	Distributed inference serving	PKU	Arxiv
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving	Pipeline Parallel; Auto parallel	UCB	OSDI 2023	Github repo
Orca: A Distributed Serving System for Transformer-Based Generative Models	Continuous batching	Seoul National University	OSDI2022
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads	Multiple Decoding Heads	Princeton University	Arxiv	Github repo
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU	Consumer-grade GPU	SJTU	Arxiv	Github repo
LLM in a flash: Efficient Large Language Model Inference with Limited Memory	flash; Pruning	Apple	Arxiv
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline	Length Perception	NUS	NeurIPS 2023	Github repo
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput		Harvard University	Arxiv
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	Decouple	PKU	OSDI 2024
Splitwise: Efficient generative LLM inference using phase splitting	Decouple	UW	OSDI 2024	Track issue
Efficiently Programming Large Language Models using SGLang	Agent Language	UCB	Arxiv	Github repo
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	Single GPU	Stanford University	Arxiv	Github repo
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve	Decouple	GaTech	OSDI 2024
SpotServe: Serving Generative Large Language Models on Preemptible Instances	Preemptible GPU	CMU	ASPLOS 2024	Empty Github repo
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification	Tree-based Speculative	CMU	ASPLOS 2024
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving	Cache the multi-turn prefill KV-cache in host-DRAM and SSD	NUS	Arxiv
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving	Use spatial-temporal multiplexing method to serve multi-LLMs	MMLab	Arxiv

Paper

Keywords

Institute (first)

Publication

Others

Fast Distributed Inference Serving for Large Language Models

Distributed inference serving

PKU

Arxiv

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Pipeline Parallel; Auto parallel

UCB

OSDI 2023

Github repo

Orca: A Distributed Serving System for Transformer-Based Generative Models

Continuous batching

Seoul National University

OSDI2022

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Multiple Decoding Heads

Princeton University

Arxiv

Github repo

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Consumer-grade GPU

SJTU

Arxiv

Github repo

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

flash; Pruning

Apple

Arxiv

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Length Perception

NUS

NeurIPS 2023

Github repo

S3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Harvard University

Arxiv

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Decouple

PKU

OSDI 2024

Splitwise: Efficient generative LLM inference using phase splitting

Decouple

OSDI 2024

Track issue

Efficiently Programming Large Language Models using SGLang

Agent Language

UCB

Arxiv

Github repo

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Single GPU

Stanford University

Arxiv

Github repo

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Decouple

GaTech

OSDI 2024

SpotServe: Serving Generative Large Language Models on Preemptible Instances

Preemptible GPU

CMU

ASPLOS 2024

Empty Github repo

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Tree-based Speculative

CMU

ASPLOS 2024

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving

Cache the multi-turn prefill KV-cache in host-DRAM and SSD

NUS

Arxiv

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving

Use spatial-temporal multiplexing method to serve multi-LLMs

MMLab

Arxiv

Operating System

Paper	Keywords	Institute(first)	Publication	Others
AIOS: LLM Agent Operating System	OS; LLM Agent	Rutgers University	Arxiv

Paper

Keywords

Institute(first)

Publication

Others

AIOS: LLM Agent Operating System

OS; LLM Agent

Rutgers University

Arxiv

Transformer accelerate

Paper	Keywords	Institute (first)	Publication	Others
TurboTransformers: An Efficient GPU serving System For Transformer Models		Tencent	PPoPP 2021	Github repo
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	FlashAttention; Online Softmax	Stanford University	NeurIPS 2023	Github repo
FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning		Stanford University	Arxiv	Github repo
FlashDecoding++: Faster Large Language Model Inference on GPUs	Softmax with Unified Maximum Value	Tsinghua University	Mlsys 2024
FlashFFTConv: Efficient Convolutions for Long Sentences with Tensor Cores	FFT; TensorCore; Long Sentences	Stanford University	Arxiv	Github repo
FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks		Georgia Institute of Technology	ASPLOS 2023
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs	Variable-Length Inputs	UCR	PPoPP 2022	Github repo
Fast Transformer Decoding: One Write-Head is All You Need	MQA	Google	Arxiv
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	GQA	Google Research	ACL 2023
LightSeq: A High Performance Inference Library for Transformers		ByteDance	NAACL 2021	Github repo
LightSeq2: LightSeq2: Accelerated Training for Transformer-based Models on GPUs		ByteDance	SC 2022
Blockwise Parallel Transformer for Large Context Models	Blockwise transformer	UCB	NeurIPS 2023	Github repo
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	Dynamic Memory Management	Microsoft Research India	Arxiv

Paper

Keywords

Institute (first)

Publication

Others

TurboTransformers: An Efficient GPU serving System For Transformer Models

Tencent

PPoPP 2021

Github repo

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention; Online Softmax

Stanford University

NeurIPS 2023

Github repo

FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning

Stanford University

Arxiv

Github repo

FlashDecoding++: Faster Large Language Model Inference on GPUs

Softmax with Unified Maximum Value

Tsinghua University

Mlsys 2024

FlashFFTConv: Efficient Convolutions for Long Sentences with Tensor Cores

FFT; TensorCore; Long Sentences

Stanford University

Arxiv

Github repo

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Georgia Institute of Technology

ASPLOS 2023

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Variable-Length Inputs

UCR

PPoPP 2022

Github repo

Fast Transformer Decoding: One Write-Head is All You Need

MQA

Google

Arxiv

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA

Google Research

ACL 2023

LightSeq: A High Performance Inference Library for Transformers

ByteDance

NAACL 2021

Github repo

LightSeq2: LightSeq2: Accelerated Training for Transformer-based Models on GPUs

ByteDance

SC 2022

Blockwise Parallel Transformer for Large Context Models

Blockwise transformer

UCB

NeurIPS 2023

Github repo

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Dynamic Memory Management

Microsoft Research India

Arxiv

Model Compression

Quant

Paper	Keywords	Institute (first)	Publication	Others
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving		SJTU	Arxiv	Github repo
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference	Dynamic Compression	NVIDIA	Arxiv

Paper

Keywords

Institute (first)

Publication

Others

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

SJTU

Arxiv

Github repo

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Dynamic Compression

NVIDIA

Arxiv

Punrning/sparisity

Paper	Keywords	Institute (first)	Publication	Others
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity		Univeristy of Sydney	VLDB 2024	Github repo
CLLMs: Consistency Large Language Models	Consistency	Shanghai Jiao Tong University	Arxiv	Github repo
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	Disaggregating Prefill and Decoding	PKU	Arxiv

Paper

Keywords

Institute (first)

Publication

Others

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Univeristy of Sydney

VLDB 2024

Github repo

CLLMs: Consistency Large Language Models

Consistency

Shanghai Jiao Tong University

Arxiv

Github repo

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Disaggregating Prefill and Decoding

PKU

Arxiv

Communication

Paper	Keywords	Institute (first)	Publication
Overlap communication with dependent compuation via Decompostion in Large Deep Learning Models	Overlap	Google	ASPLOS 2023
Efficiently scaling Transformer inference	Scaling	Google	Mlsys 2023
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication	communication partition	PKU	ASPLOS 2024

Paper

Keywords

Institute (first)

Publication

Others

Overlap communication with dependent compuation via Decompostion in Large Deep Learning Models

Overlap

Google

ASPLOS 2023

Efficiently scaling Transformer inference

Scaling

Google

Mlsys 2023

Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication

communication partition

PKU

ASPLOS 2024

Energy

Paper	Keywords	Institute (first)	Publication	Others
Zeus: Understanding and Optimizing GPU energy Consumption of DNN Training		Yale University	NSDI 2023	Github repo

Paper

Keywords

Institute (first)

Publication

Others

Zeus: Understanding and Optimizing GPU energy Consumption of DNN Training

Yale University

NSDI 2023

Github repo

Decentralized

Paper	Keywords	Institute (first)	Publication	Others
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs	Consumer-grade GPU	HKBU	Arxiv
Petals: Collaborative Inference and Fine-tuning of Large Models		Yandex	Arxiv

Paper

Keywords

Institute (first)

Publication

Others

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Consumer-grade GPU

HKBU

Arxiv

Petals: Collaborative Inference and Fine-tuning of Large Models

Yandex

Arxiv

Serveless

Paper	Keywords	Institute (first)	Publication	Others
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models		The University of Edinburgh	Arxiv	Empty Github repo

Paper

Keywords

Institute (first)

Publication

Others

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

The University of Edinburgh

Arxiv

Empty Github repo

NextDaily reading

Last updated 4 days ago