Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.

Survey

PaperKeywordsInstitute (first)PublicationOthers

Hardware and software co-design

UCB

Arxiv

Transformer optimization

Iowa State Univeristy

Journal of Systems Architecture

Model Compression

UCSD

Arxiv

Optimization technique: quant, pruning, continuous batching, virtual memory

CMU

Arxiv

Performance analysis

Infinigence-AI

Arxiv

Framework

Paper/OpenSource ProjectKeywordsInstitute (first)PublicationOthers

Deepspeed; Kerenl Fusion

MicroSoft

SC 2022

Deepspeed; Split fuse

MicroSoft

Arxiv

vLLM; pagedAttention

UCB

SOSP 2023

NVIDIA

Shanghai Artifcial Intelligence Laboratory

TVM; Multi-platforms

MLC-Team

Huggingface

Serving

PaperKeywordsInstitute (first)PublicationOthers

Distributed inference serving

PKU

Arxiv

Pipeline Parallel; Auto parallel

UCB

OSDI 2023

Continuous batching

Seoul National University

OSDI2022

Multiple Decoding Heads

Princeton University

Arxiv

Consumer-grade GPU

SJTU

Arxiv

flash; Pruning

Apple

Arxiv

Length Perception

NUS

NeurIPS 2023

Harvard University

Arxiv

Decouple

PKU

OSDI 2024

Decouple

UW

OSDI 2024

Agent Language

UCB

Arxiv

Single GPU

Stanford University

Arxiv

Decouple

GaTech

OSDI 2024

Preemptible GPU

CMU

ASPLOS 2024

Tree-based Speculative

CMU

ASPLOS 2024

Cache the multi-turn prefill KV-cache in host-DRAM and SSD

NUS

Arxiv

Use spatial-temporal multiplexing method to serve multi-LLMs

MMLab

Arxiv

Operating System

PaperKeywordsInstitute(first)PublicationOthers

OS; LLM Agent

Rutgers University

Arxiv

Transformer accelerate

PaperKeywordsInstitute (first)PublicationOthers

Tencent

PPoPP 2021

FlashAttention; Online Softmax

Stanford University

NeurIPS 2023

Stanford University

Arxiv

Softmax with Unified Maximum Value

Tsinghua University

Mlsys 2024

FFT; TensorCore; Long Sentences

Stanford University

Arxiv

Georgia Institute of Technology

ASPLOS 2023

Variable-Length Inputs

UCR

PPoPP 2022

MQA

Google

Arxiv

GQA

Google Research

ACL 2023

ByteDance

NAACL 2021

ByteDance

SC 2022

Blockwise transformer

UCB

NeurIPS 2023

Dynamic Memory Management

Microsoft Research India

Arxiv

Model Compression

Quant

Punrning/sparisity

Communication

PaperKeywordsInstitute (first)PublicationOthers

Overlap

Google

ASPLOS 2023

Scaling

Google

Mlsys 2023

Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication

communication partition

PKU

ASPLOS 2024

Energy

PaperKeywordsInstitute (first)PublicationOthers

Yale University

NSDI 2023

Decentralized

Serveless

PaperKeywordsInstitute (first)PublicationOthers

The University of Edinburgh

Arxiv

Last updated