FlashAttention:Fast and Memory-Efficient Exact Attention with IO-Awareness
PreviousGQA:Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsNextEfficient Memory Management for Large Language Model Serving with PagedAttention
Last updated