Flash Attention

Flash attention is a fast…

Flash attention is a fast, memory-efficient algorithm that speeds up the attention mechanism in transformer models by minimizing data transfers between different levels of GPU memory

How Flash Attention works

Why it is important

What is HBM? (High-Bandwidth Memory)

Flash attention is a memory-efficient algorithm that speeds up the computation of attention in transformer models by avoiding the need to store large intermediate matrices, while HBM (High Bandwidth Memory) is a type of high-performance memory that is physically stacked in a 3D architecture to achieve very high bandwidth. Flash attention works by using a technique called tiling to process attention in smaller blocks, keeping only small chunks in faster on-chip memory (SRAM) and minimizing slow reads and writes to HBM.