LLaMA中使用的Positional Embedding

2023-07-20

前言

這篇筆記整理 LLaMA 使用的 Positional Embedding，也就是 RoPE（Rotary Position Embedding）。Positional Encoding 是 Transformer 中很容易被略過、但其實很關鍵的細節：它讓模型能在沒有 RNN 時間序列結構的情況下，仍然保留 token 的順序資訊。

Positional Encoding 簡介

Transformer 會一次讀入整段 input，不像 RNN 會依照時間序列逐步傳入。因此，input embedding 必須額外加入位置資訊。

舉例來說：

I Am that I Am.

如果用 RNN 處理，token 會按照順序被送進模型；但 Transformer 一次看到整句話，如果沒有位置資訊，就很難分辨不同位置的 I 或 Am 在關係計算中的差異。

Positional Encoding 常見做法可以分成三類：

絕對位置編碼：最直覺的做法是直接對 input embedding 加上 index。不過 index 很大時，可能影響原本 embedding 的語意資訊。
相對位置編碼：直接建模 token 之間的相對距離，例如 Self-Attention with Relative Position Representations，可以減少部分 weight matrix 的運算。
融合式：表面上使用絕對位置編碼，但經過 attention 內積後，結果會呈現相對位置關係。常見例子包含 Attention Is All You Need 的三角函數位置編碼，以及 RoPE。

RoPE 的特色是使用複數旋轉來編碼位置，而 LLaMA 採用的正是這個方法。

LLaMA使用的Positional Embedding

LLaMA 使用的 Positional Embedding 是 RoPE。它可以視為一種融合絕對位置與相對位置資訊的方法；如果硬要分類，會比較接近「用絕對位置編碼達成相對位置效果」。

RoPE 的核心形式如下：

兩個複數做內積時，可以理解成將其中一個複數取共軛後相乘，再取實部。複數與共軛複數相乘時，指數部分會變成相減，因此 RoPE 可以把絕對位置放進歐拉表示中，再透過 attention 內積留下相對位置資訊。

RoPE Code 分析

以下程式碼來自 facebookresearch/llama。RoPE 最重要的三個 function 是 precompute_freqs_cis、reshape_for_broadcast 與 apply_rotary_emb。

#......
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)
#......

precompute_freqs_cis 會先計算每個位置的歐拉表示，也就是 $e^{im\theta}$ 與 $e^{in\theta}$。

apply_rotary_emb 則會計算 $q_m e^{im\theta}$ 與 $k_n e^{in\theta}$，再把結果轉成實數表示，例如 $[q_m cos(im\theta), q_m sin(im\theta)]$ 與 $[k_n cos(in\theta), k_n sin(in\theta)]$。後續 attention 內積就會直接使用這些帶有旋轉位置資訊的 Q、K。

reshape_for_broadcast 的用途是把 xq_、xk_ 與 freqs_cis 調整成可以 broadcast 的形狀，讓矩陣能逐元素相乘。

小結

RoPE 有趣的地方在於，它把位置資訊藏在複數旋轉裡，最後透過 attention 的內積自然留下相對位置關係。這種設計把訊號處理與深度學習結合得很漂亮，也讓 Positional Encoding 不只是「加上一個位置向量」這麼簡單。

前言

Positional Encoding 簡介

LLaMA使用的Positional Embedding

RoPE Code 分析

小結

Reference