為什麼torch.nn.transformer中每個input的feature size需要是head數量的倍數

2023-07-14

前言

這篇筆記整理 torch.nn.Transformer 中一個常見限制：d_model 必須能被 nhead 整除。

從理論上看，Multi-Head Attention 可以想成對同一份 input 做多組 Self-Attention，再把多個 head 的輸出接起來。因此直覺上會覺得 feature size 與 head 數量不一定要整除。但 PyTorch 的實作為了效率，會把 embedding dimension 平均切給每個 head，這就是限制的來源。

Multi-Head Transformer 理論

Multi-Head Transformer 的概念是：對同一個 input 做多個 Self-Attention，將多次輸出 concat 後，再透過一個矩陣投影回原本大小。具體流程如下：

在這個抽象描述裡，輸入 x 的 feature size 和 head 數量看起來沒有硬性關係。也就是說，如果只看理論流程，不論 feature size 與 head 數量是多少，似乎都可以訓練。

為什麼 `d_model` 需要被 `nhead` 整除

如果對 nn.Transformer 填入任意的 feature size 與 head 數量，可能會遇到錯誤訊息，提示 embed_dim 必須能被 num_heads 整除。

原因可以從 PyTorch source code 看出來。首先看 nn.Transformer，其中與 nhead、d_model 相關的部分會進到 TransformerEncoderLayer。

class Transformer(Module):
    #......
    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        torch._C._log_api_usage_once(f"torch.nn.modules.{self.__class__.__name__}")

        if custom_encoder is not None:
            self.encoder = custom_encoder
        else:
            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        ### ......

接著看 TransformerEncoderLayer，可以發現真正處理 attention 的類別是 MultiheadAttention。

class TransformerEncoderLayer(Module):
    #......
    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        #......
    #......

繼續追 MultiheadAttention，會看到它把 forward 的核心邏輯交給 F.multi_head_attention_forward。

class MultiheadAttention(Module):
    #.......
    def forward(
            self,
            query: Tensor,
            key: Tensor,
            value: Tensor,
            key_padding_mask: Optional[Tensor] = None,
            need_weights: bool = True,
            attn_mask: Optional[Tensor] = None,
            average_attn_weights: bool = True,
            is_causal : bool = False) -> Tuple[Tensor, Optional[Tensor]]:
        #.......
        if not self._qkv_same_embed_dim:
            attn_output, attn_output_weights = F.multi_head_attention_forward(
                query, key, value, self.embed_dim, self.num_heads,
                self.in_proj_weight, self.in_proj_bias,
                self.bias_k, self.bias_v, self.add_zero_attn,
                self.dropout, self.out_proj.weight, self.out_proj.bias,
                training=self.training,
                key_padding_mask=key_padding_mask, need_weights=need_weights,
                attn_mask=attn_mask,
                use_separate_proj_weight=True,
                q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
                v_proj_weight=self.v_proj_weight,
                average_attn_weights=average_attn_weights,
                is_causal=is_causal)
        else:
            attn_output, attn_output_weights = F.multi_head_attention_forward(
                query, key, value, self.embed_dim, self.num_heads,
                self.in_proj_weight, self.in_proj_bias,
                self.bias_k, self.bias_v, self.add_zero_attn,
                self.dropout, self.out_proj.weight, self.out_proj.bias,
                training=self.training,
                key_padding_mask=key_padding_mask,
                need_weights=need_weights,
                attn_mask=attn_mask,
                average_attn_weights=average_attn_weights,
                is_causal=is_causal)
        if self.batch_first and is_batched:
            return attn_output.transpose(1, 0), attn_output_weights
        else:
            return attn_output, attn_output_weights
    #......

最後看 F.multi_head_attention_forward。關鍵在這段：PyTorch 會先用 embed_dim // num_heads 算出每個 head 分到的維度，並 assert 這個拆分必須剛好整除。

def multi_head_attention_forward(
    query: Tensor,
    key: Tensor,
    value: Tensor,
    embed_dim_to_check: int,
    num_heads: int,
    #......
    )
    #......
    else:
        head_dim = embed_dim // num_heads
    assert head_dim * num_heads == embed_dim, f"embed_dim {embed_dim} not divisible by num_heads {num_heads}"
    #......
    q = q.view(bsz, num_heads, tgt_len, head_dim)
        k = k.view(bsz, num_heads, src_len, head_dim)
        v = v.view(bsz, num_heads, src_len, head_dim)

        attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
        attn_output = attn_output.permute(2, 0, 1, 3).contiguous().view(bsz * tgt_len, embed_dim)

        attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
        attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
        if not is_batched:
            # squeeze the output if input was unbatched
            attn_output = attn_output.squeeze(1)
        return attn_output, None

也就是說，PyTorch 的實作會把 Q、K、V reshape 成 4 維：

batch size
number of head
target/source length
head dimension

原本的 feature size 會被拆成 number of head * head dimension。如果 embed_dim 不能被 num_heads 整除，就無法平均 reshape，因此會直接 assert。

簡單來說，torch.nn.Transformer 的 multi-head 實作不是把完整 input 重複餵給每個 head，而是將 feature size 平均拆成多份，每份交給不同 head 計算，最後再接回來。這種做法可以大幅節省運算量與記憶體使用，但代價就是 d_model 必須被 nhead 整除。

小結

這個例子展示了理論描述與工程實作之間的差異。理論上 Multi-Head Attention 可以用比較抽象的方式理解，但在框架實作中，為了讓張量 reshape、batch 運算與 GPU 加速更有效率，會加入更明確的維度限制。

這也是讀 framework source code 很有價值的地方：除了理解模型，也能看到實作用哪些假設換取效能與穩定性。

前言

Multi-Head Transformer 理論

為什麼 d_model 需要被 nhead 整除

小結

為什麼 `d_model` 需要被 `nhead` 整除