張君實

使用nni尋找最佳超參數

2023-08-27T02:13:00.000Z

前言

這篇筆記整理如何使用 NNI（Neural Network Intelligence）做超參數搜尋。範例任務是用 CNN 訓練 CIFAR-10，並透過 NNI 搜尋 learning rate、momentum 與 batch size。

作業檔案簡介

程式碼放在 ThreeMonth03/hyperparameter_tuning。

主要目錄如下：

config/：放 requirement.txt。
src/：放 source code，包含 cnn.py 與 nni_search.py。
log/：放 NNI experiment log，可以回放歷史 training 紀錄。

如何從頭復現 NNI Training

1
2
3

git clone https://github.com/ThreeMonth03/hyperparameter_tuning.git
cd hyperparameter_tuning
docker-compose up

接著在瀏覽器打開：

1	http://localhost:[your_port]

這裡建議不要使用 docker-compose up -d，否則 experiment log 可能不會被正常保存。實際部署時，也記得依照環境調整 port、container name 與 image name。

如何直接看 Training Log

如果只想查看既有 log，可以改用 nni_search.py 裡的 experiment.view：

1	experiment.view(experiment_id, port=8323, non_blocking=False)

操作流程：

Clone repo。
依照 nni_search.py 內的註解，關閉 training 設定，打開 experiment.view(...)。
執行 docker-compose up。
到 localhost:[your_port] 查看結果。

實驗設定

Hyperparameter	Search Space
`lr`	`0.0001 ~ 0.1`，log uniform
`momentum`	`0 ~ 1`，uniform
`batch_size`	`4`、`8`、`16`
Tuner	TPE

Result

Best hyperparameter：

lr: 0.0024724673142795927
momentum: 0.31344560117709097
batch_size: 8

Test Accuracy：65%

筆記

如何用 Python API 調 Hyperparameter

NNI 可以透過 terminal 指令或 Python API 控制 hyperparameter。以下是透過 Python API 設定 search space 與 experiment 的範例。

# nni_search.py
search_space = {
    'lr': {'_type': 'loguniform', '_value': [0.0001, 0.1]},
    'momentum': {'_type': 'uniform', '_value': [0, 1]},
    'batch_size': {"_type": "choice", "_value": [4, 8, 16]},
}

import nni
from nni.experiment import Experiment

experiment = Experiment('local')
experiment.config.trial_command = 'python src/cnn.py'
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 50
experiment.config.trial_concurrency = 10
experiment.config.trial_gpu_number = 3
experiment.config.debug = True
experiment.config.experiment_working_directory = './log'
experiment.config.training_service.use_active_gpu = True
experiment.config.training_service.max_trial_number_per_gpu = 10

experiment.run(8323)
print(experiment.get_status())
print(experiment.get_job_statistics())
print(experiment.list_trial_jobs())

input('Press enter to quit')
experiment.stop()

被控制的 model 也要加入 NNI 參數讀取與回報結果的邏輯。

# cnn.py
import nni
#......
params = {
    'lr': 0.001,
    'momentum': 0,
    'batch_size': 4,
}
optimized_params = nni.get_next_parameter()
params.update(optimized_params)
print(params)
#......
epoches = 20
batch_size = params['batch_size']
lr = params['lr']
momentum = params['momentum']
#......
with torch.no_grad():
    for data in testloader:
        images, labels = data[0].to(device), data[1].to(device)
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')
nni.report_final_result(100 * correct // total)

小結

NNI 的好處是可以把「手動反覆調參」變成可重現的實驗流程。只要把 search space、tuner 與 training script 接好，就能自動化比較不同超參數組合，並保留 experiment log 供後續分析。

Reference

透過docker建立jupyter與tensorboard環境

2023-07-20T19:18:00.000Z

前言

這篇筆記整理如何用 Docker Compose 建立 Jupyter 與 TensorBoard 環境，並透過 local forwarding 在本機瀏覽器使用遠端服務。

範例 repo 放在 ThreeMonth03/Docker_example。

專案結構

這個 repository 裡有幾個重點：

jupyter/：Jupyter 服務的 Dockerfile。
tensorboard/：TensorBoard 服務的 Dockerfile。
docker-compose.yml：管理兩個 image 與 container。
main.ipynb、logs/：用來驗證 Jupyter 與 TensorBoard 是否正常。

執行流程

如果服務跑在遠端機器上，可以先透過 SSH local forwarding 把 port 轉回本機。假設要轉兩個服務：

1 2	ssh -L localhost:8323:localhost:8323 [Account]@[Server IP] ssh -L localhost:8324:localhost:8324 [Account]@[Server IP]

接著在遠端機器 clone repo，並啟動服務：

1
2
3

git clone https://github.com/ThreeMonth03/Docker_example.git
cd Docker_example
docker-compose up -d

啟動後，在本機瀏覽器打開：

1 2	http://localhost:8323/ http://localhost:8324/

使用完後關閉服務：

1	docker-compose down

如果曾經修改過 docker-compose.yml，導致出現 orphan container，可以改用：

1	docker-compose down --remove-orphans

Local Forwarding

有時候遠端 server 的特定 port 不會直接對外開放。這時可以透過 SSH local forwarding，讓本機 port 對應到遠端 server 的 port。

基本格式如下：

1	ssh -L localhost:[your_computer_port]:localhost:[server_port] [Account]@[Server IP]

例如帳號是 threemonth，server IP 是 123.456.78.901，要把本機 9090 對到遠端 8080：

1	ssh -L localhost:9090:localhost:8080 threemonth@123.456.78.901

Jupyter 與 TensorBoard 指令

Jupyter 可以用以下指令啟動：

jupyter notebook \
  --no-browser \
  --ip=0.0.0.0 \
  --port=8080 \
  --allow-root \
  --NotebookApp.token='' \
  --NotebookApp.password=''

幾個參數用途：

--no-browser：避免 server 嘗試打開瀏覽器。
--ip=0.0.0.0：讓外部位址可以連到 Jupyter 服務。
--port：指定 container 內服務 port。
--allow-root：允許 root 身分執行 Jupyter。
--NotebookApp.token='' 與 --NotebookApp.password=''：關閉 token 與 password 驗證，適合搭配受控環境或 tunnel 使用。

TensorBoard 可以用以下指令啟動：

1	tensorboard --logdir ./logs --host=0.0.0.0 --port=8081

幾個參數用途：

--logdir：指定 log 路徑。
--host=0.0.0.0：讓外部位址可以連到 TensorBoard。
--port：指定服務 port。

Image 與 Container

Docker 裡最常遇到兩個概念：image 與 container。

Image：環境模板，描述要安裝哪些套件、預設執行什麼指令。
Container：根據 image 開出來的執行實例，真正提供服務。

Image 可以保留並重複使用；container 用完後通常可以刪掉，下次再用同一個 image 開新的 container。

常用 Image 指令

根據 Dockerfile 建立 image：

1	docker build -t [image_name] [path]

例如 Dockerfile 在目前資料夾，要建立 jupyter_image：

1	docker build -t jupyter_image .

如果想忽略 cache 重新 build：

1	docker build -t [image_name] [path] --no-cache

查看 image：

1	docker images

刪除 image：

1	docker image rm [image_name]

Dockerfile 範例

以下是一個 TensorBoard image 的 Dockerfile：

FROM pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel

RUN apt-get update &&\
    apt-get -y upgrade &&\
    apt-get install -y git net-tools vim sudo tcsh gcc g++ unzip python3 python3-pip &&\
    apt-get clean &&\
    rm -rf /var/lib/apt/lists/*

RUN pip3 --no-cache-dir install torch \
    torchvision \
    torchaudio \
    tensorboard \
    jupyterlab \
    jupyter

CMD ["tensorboard", "--logdir", "./logs", "--host=0.0.0.0", "--port=8324"]

重點如下：

FROM：指定 base image。
RUN：安裝套件或執行建置指令。
CMD：container 啟動後預設執行的 command。

常用 Container 指令

建好 image 後，就可以用 docker run 建立並執行 container：

1	docker run [options] [image_name] [command]

常見 options：

-it：開啟互動式 terminal，常搭配 bash 使用。
--name：指定 container 名稱，建議加上，方便管理。
-p：做 port mapping，例如 8080:8080。
-v：mount 本機資料夾到 container 內。
--gpus all：讓 container 使用 GPU。
command：覆蓋 image 中的預設 CMD。

只執行 jupyter_image，並命名為 jupyter_container：

1	docker run --name jupyter_container jupyter_image

如果想開 terminal、轉 port、掛載目前資料夾、並使用 GPU：

1	docker run -it --name jupyter_container -p 8080:8080 -v ./:/workspace --gpus all jupyter_image bash

離開 container：

exit

重新啟動並 attach：

1 2	docker start [container_name] docker attach [container_name]

查看 container：

1 2	docker ps docker ps -a

刪除 container：

1	docker rm [CONTAINER ID]

Docker Compose

如果一次要管理多個服務，例如 Jupyter 與 TensorBoard，就適合使用 Docker Compose。它可以用一份 docker-compose.yml 管理多個 image、container、port 與 volume。

啟動服務：

1	docker-compose up

讓服務在背景執行：

1	docker-compose up -d

停用服務：

1	docker-compose down

清掉 orphan container：

1	docker-compose down --remove-orphans

docker-compose.yml 範例

version: "3"
services:
  Jupyter:
    build: ./jupyter
    image: docker/threemonth
    container_name: jupyterthreemonth
    ports: 
    - "8323:8323"
    volumes:
    - ./:/workspace 
    restart: unless-stopped
    command: jupyter notebook --no-browser --ip=0.0.0.0 --port=8323 --allow-root --NotebookApp.token='' --NotebookApp.password=''

  Tensorboard:
    build: ./tensorboard
    image: docker/threemonth2
    container_name: tensorboardthreemonth
    ports: 
    - "8324:8324"
    depends_on:
    - Jupyter
    volumes:
    - ./:/workspace 
    restart: unless-stopped

幾個欄位對應到 docker run 的概念：

build：指定 Dockerfile 所在目錄。
image：image 名稱。
container_name：container 名稱。
ports：對應 docker run -p。
volumes：對應 docker run -v。
restart: unless-stopped：除非手動停止，否則 container 掛掉後會自動重啟。
depends_on：控制服務啟動順序。
command：覆蓋 Dockerfile 中的預設 CMD。

小結

Docker Compose 很適合用來管理多個彼此相關的開發服務。這個範例把 Jupyter 與 TensorBoard 分成兩個 container，再透過 volume 共用工作目錄，最後用 local forwarding 讓本機可以安全地連到遠端服務。

Reference

LLaMA中使用的Positional Embedding

2023-07-20T00:25:00.000Z

前言

這篇筆記整理 LLaMA 使用的 Positional Embedding，也就是 RoPE（Rotary Position Embedding）。Positional Encoding 是 Transformer 中很容易被略過、但其實很關鍵的細節：它讓模型能在沒有 RNN 時間序列結構的情況下，仍然保留 token 的順序資訊。

Positional Encoding 簡介

Transformer 會一次讀入整段 input，不像 RNN 會依照時間序列逐步傳入。因此，input embedding 必須額外加入位置資訊。

舉例來說：

I Am that I Am.

如果用 RNN 處理，token 會按照順序被送進模型；但 Transformer 一次看到整句話，如果沒有位置資訊，就很難分辨不同位置的 I 或 Am 在關係計算中的差異。

Positional Encoding 常見做法可以分成三類：

絕對位置編碼：最直覺的做法是直接對 input embedding 加上 index。不過 index 很大時，可能影響原本 embedding 的語意資訊。
相對位置編碼：直接建模 token 之間的相對距離，例如 Self-Attention with Relative Position Representations，可以減少部分 weight matrix 的運算。
融合式：表面上使用絕對位置編碼，但經過 attention 內積後，結果會呈現相對位置關係。常見例子包含 Attention Is All You Need 的三角函數位置編碼，以及 RoPE。

RoPE 的特色是使用複數旋轉來編碼位置，而 LLaMA 採用的正是這個方法。

LLaMA使用的Positional Embedding

LLaMA 使用的 Positional Embedding 是 RoPE。它可以視為一種融合絕對位置與相對位置資訊的方法；如果硬要分類，會比較接近「用絕對位置編碼達成相對位置效果」。

RoPE 的核心形式如下：

兩個複數做內積時，可以理解成將其中一個複數取共軛後相乘，再取實部。複數與共軛複數相乘時，指數部分會變成相減，因此 RoPE 可以把絕對位置放進歐拉表示中，再透過 attention 內積留下相對位置資訊。

RoPE Code 分析

以下程式碼來自 facebookresearch/llama。RoPE 最重要的三個 function 是 precompute_freqs_cis、reshape_for_broadcast 與 apply_rotary_emb。

#......
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)
#......

precompute_freqs_cis 會先計算每個位置的歐拉表示，也就是 $e^{im\theta}$ 與 $e^{in\theta}$。

apply_rotary_emb 則會計算 $q_m e^{im\theta}$ 與 $k_n e^{in\theta}$，再把結果轉成實數表示，例如 $[q_m cos(im\theta), q_m sin(im\theta)]$ 與 $[k_n cos(in\theta), k_n sin(in\theta)]$。後續 attention 內積就會直接使用這些帶有旋轉位置資訊的 Q、K。

reshape_for_broadcast 的用途是把 xq_、xk_ 與 freqs_cis 調整成可以 broadcast 的形狀，讓矩陣能逐元素相乘。

小結

RoPE 有趣的地方在於，它把位置資訊藏在複數旋轉裡，最後透過 attention 的內積自然留下相對位置關係。這種設計把訊號處理與深度學習結合得很漂亮，也讓 Positional Encoding 不只是「加上一個位置向量」這麼簡單。

Reference

為什麼torch.nn.transformer中每個input的feature size需要是head數量的倍數

2023-07-14T09:29:00.000Z

前言

這篇筆記整理 torch.nn.Transformer 中一個常見限制：d_model 必須能被 nhead 整除。

從理論上看，Multi-Head Attention 可以想成對同一份 input 做多組 Self-Attention，再把多個 head 的輸出接起來。因此直覺上會覺得 feature size 與 head 數量不一定要整除。但 PyTorch 的實作為了效率，會把 embedding dimension 平均切給每個 head，這就是限制的來源。

Multi-Head Transformer 理論

Multi-Head Transformer 的概念是：對同一個 input 做多個 Self-Attention，將多次輸出 concat 後，再透過一個矩陣投影回原本大小。具體流程如下：

在這個抽象描述裡，輸入 x 的 feature size 和 head 數量看起來沒有硬性關係。也就是說，如果只看理論流程，不論 feature size 與 head 數量是多少，似乎都可以訓練。

為什麼 `d_model` 需要被 `nhead` 整除

如果對 nn.Transformer 填入任意的 feature size 與 head 數量，可能會遇到錯誤訊息，提示 embed_dim 必須能被 num_heads 整除。

原因可以從 PyTorch source code 看出來。首先看 nn.Transformer，其中與 nhead、d_model 相關的部分會進到 TransformerEncoderLayer。

class Transformer(Module):
    #......
    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        torch._C._log_api_usage_once(f"torch.nn.modules.{self.__class__.__name__}")

        if custom_encoder is not None:
            self.encoder = custom_encoder
        else:
            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        ### ......

接著看 TransformerEncoderLayer，可以發現真正處理 attention 的類別是 MultiheadAttention。

class TransformerEncoderLayer(Module):
    #......
    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        #......
    #......

繼續追 MultiheadAttention，會看到它把 forward 的核心邏輯交給 F.multi_head_attention_forward。

class MultiheadAttention(Module):
    #.......
    def forward(
            self,
            query: Tensor,
            key: Tensor,
            value: Tensor,
            key_padding_mask: Optional[Tensor] = None,
            need_weights: bool = True,
            attn_mask: Optional[Tensor] = None,
            average_attn_weights: bool = True,
            is_causal : bool = False) -> Tuple[Tensor, Optional[Tensor]]:
        #.......
        if not self._qkv_same_embed_dim:
            attn_output, attn_output_weights = F.multi_head_attention_forward(
                query, key, value, self.embed_dim, self.num_heads,
                self.in_proj_weight, self.in_proj_bias,
                self.bias_k, self.bias_v, self.add_zero_attn,
                self.dropout, self.out_proj.weight, self.out_proj.bias,
                training=self.training,
                key_padding_mask=key_padding_mask, need_weights=need_weights,
                attn_mask=attn_mask,
                use_separate_proj_weight=True,
                q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
                v_proj_weight=self.v_proj_weight,
                average_attn_weights=average_attn_weights,
                is_causal=is_causal)
        else:
            attn_output, attn_output_weights = F.multi_head_attention_forward(
                query, key, value, self.embed_dim, self.num_heads,
                self.in_proj_weight, self.in_proj_bias,
                self.bias_k, self.bias_v, self.add_zero_attn,
                self.dropout, self.out_proj.weight, self.out_proj.bias,
                training=self.training,
                key_padding_mask=key_padding_mask,
                need_weights=need_weights,
                attn_mask=attn_mask,
                average_attn_weights=average_attn_weights,
                is_causal=is_causal)
        if self.batch_first and is_batched:
            return attn_output.transpose(1, 0), attn_output_weights
        else:
            return attn_output, attn_output_weights
    #......

最後看 F.multi_head_attention_forward。關鍵在這段：PyTorch 會先用 embed_dim // num_heads 算出每個 head 分到的維度，並 assert 這個拆分必須剛好整除。

def multi_head_attention_forward(
    query: Tensor,
    key: Tensor,
    value: Tensor,
    embed_dim_to_check: int,
    num_heads: int,
    #......
    )
    #......
    else:
        head_dim = embed_dim // num_heads
    assert head_dim * num_heads == embed_dim, f"embed_dim {embed_dim} not divisible by num_heads {num_heads}"
    #......
    q = q.view(bsz, num_heads, tgt_len, head_dim)
        k = k.view(bsz, num_heads, src_len, head_dim)
        v = v.view(bsz, num_heads, src_len, head_dim)

        attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
        attn_output = attn_output.permute(2, 0, 1, 3).contiguous().view(bsz * tgt_len, embed_dim)

        attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
        attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
        if not is_batched:
            # squeeze the output if input was unbatched
            attn_output = attn_output.squeeze(1)
        return attn_output, None

也就是說，PyTorch 的實作會把 Q、K、V reshape 成 4 維：

batch size
number of head
target/source length
head dimension

原本的 feature size 會被拆成 number of head * head dimension。如果 embed_dim 不能被 num_heads 整除，就無法平均 reshape，因此會直接 assert。

簡單來說，torch.nn.Transformer 的 multi-head 實作不是把完整 input 重複餵給每個 head，而是將 feature size 平均拆成多份，每份交給不同 head 計算，最後再接回來。這種做法可以大幅節省運算量與記憶體使用，但代價就是 d_model 必須被 nhead 整除。

小結

這個例子展示了理論描述與工程實作之間的差異。理論上 Multi-Head Attention 可以用比較抽象的方式理解，但在框架實作中，為了讓張量 reshape、batch 運算與 GPU 加速更有效率，會加入更明確的維度限制。

這也是讀 framework source code 很有價值的地方：除了理解模型，也能看到實作用哪些假設換取效能與穩定性。

Efficient Processing of Deep Neural Networks, A Tutorial and Survey小結精讀

2023-06-21T02:06:00.000Z

Introduction to Hardware

DNN 常見的運算包含 Convolution 與 FC Layer，這兩類 Layer 的核心多半是大量點積運算，而點積又會仰賴 MAC（Multiply-Accumulate，乘法累加器）。在硬體裡，MAC 通常會放在 ALU 中，因此一塊運算單元會配置不少 ALU 來支撐高吞吐量。

這篇 survey 將 DNN 運算硬體分成兩大類：

Temporal Architecture：代表是 CPU、GPU。
Spatial Architecture：代表是 ASIC、FPGA 這類 DNN accelerator。

Temporal Architecture and Spatial Architecture

Temporal Architecture 主要包含 CPU 與 GPU。這類架構通常透過 SIMD、SIMT 與演算法最佳化來縮短運算時間。

Spatial Architecture 則比較接近加速器的設計思路，常見實作是 ASIC 或 FPGA。它會將 Register File 與 Control Logic 放在 ALU 附近，藉此減少資料搬移，降低能耗。

Temporal Architecture

Temporal Architecture 的重點比較偏向「如何把 DNN 運算轉成適合 CPU/GPU 執行的形式」。

FC Layer and Convolutional Layer

FC Layer 本質上就是 2 維 Input 與 Filter 的內積。如果 Input 原本不是 2 維，例如包含 Channel 或 Batch Size，就會先 flatten 成 2 維矩陣，其中一維通常是 Batch Size。轉成矩陣乘法後，CPU/GPU 就可以用 SIMD 或 SIMT 來加速。

Convolutional Layer 也可以用類似方式處理：把 convolution kernel 攤成 1 維陣列，再把 Input 攤成對應的 2 維陣列，最後轉成矩陣內積。缺點是轉換後的 Input feature map 會變大，尤其在多個 Filter 同時計算時，Input feature map 會被重複展開。

Convolution Optimization

作者提到兩種常見的矩陣內積最佳化方法：

FFT：適合把 convolution 轉到頻域計算。
Strassen Algorithm：適合加速矩陣乘法。

FFT 的複雜度可從 O(No^2 * Nf^2) 降到 O(No^2 * log No)，其中 Nf 是 Filter 的長寬，No 是 Output 的長寬。不過 Nf^2 不一定比 log No 大，因此作者建議 Filter 較大時，例如 filter > 5 * 5，再考慮 FFT。

Strassen Algorithm 則是把矩陣乘法從 O(N^3) 降到 O(N^(log7/log2))。由於 convolution 轉矩陣乘法時，Filter 越大，展開後的 Input 越容易膨脹，所以 Strassen 比較適合 Filter 較小的情境，例如 filter < 3 * 3。

Spatial Architecture

MAC 運作時會從 DRAM 讀取資料、進行乘法與加法，最後再把結果寫回 DRAM。從能耗比較可以看到，DRAM 存取的成本是 ALU 運算的上百倍。

因此 Spatial Architecture 的目標是減少資料搬移。它會把 Register File 與 Control Logic 放在 ALU 旁邊，組成 Processing Engine（PE），讓資料能在更靠近運算單元的位置被重複使用。

Data Reuse

PE 中的 Register File 離 ALU 很近，因此設計重點會變成如何提高 Data Reuse。作者將 Data Reuse 分成三種：

Convolutional Reuse：Convolution 運算中，Filter 與 Input 都會被重複使用。
Fmap Reuse：一筆 Feature Map 資料可能被多個 Filter 使用。
Filter Reuse：多筆資料可能共用同一個 Filter，常見於 Batch Size > 1。

Term Explanation

在介紹 AI 加速器的種類前，先整理幾個常用術語。

一般電腦程式會先透過 Compiler 編譯成 Binary code，再交給 Processor 執行；Processor 的結構通常稱為 Architecture。

AI 加速器的流程很像：DNN 會先透過 Mapper 轉成對應的 Mapping，再交給 DNN Accelerator 執行；DNN Accelerator 的資料流設計則稱為 Dataflow。

Dataflow

AI 加速器的 Dataflow 主要可以分成 4 種：

Weight Stationary（WS）

WS 的目標是最小化讀取 Weights 的能耗，也就是盡量讓 Weight 留在 Register File 中重複使用。

Output Stationary（OS）

OS 的目標是最小化讀寫 Partial Sum 的能耗，也就是盡量讓 Partial Sum 留在 Register File 中。

OS Dataflow 又可以依照 Channel 與 Activation 的數量分成 OSA、OSB、OSC。OSA 主要處理 Convolution，OSC 主要處理 FC Layer，OSB 則介於兩者之間。

No Local Reuse（NLR）

Row Stationary（RS）

RS 的目標是最大化所有類型資料在 Register File 中的重複使用機會。以 1 維 Convolution 為例，PE 中的 Filter 幾乎固定不動，只位移 Input 與 Partial Sum 來完成運算。

2 維陣列的概念也類似，只是會使用多個 PE。Filter 仍然維持相對固定，每次計算 Row 時主要位移 Input 與 Partial Sum。

如果有多個 Channel 或 Batch Size 大於 1，則可以透過連接與交錯的方式得到對應輸出。

Dataflow Example

作者使用 Eyeriss DNN Accelerator 作為範例，它的 PE Array 大小為 12 * 14。

這時會遇到兩個問題：

PE Array 和 Layer 大小不同：Layer 較小時，可以一次塞多張 Layer 到 PE Array；Layer 較大時，則可以透過裁切或 folding 塞進 PE Array。

資料不知道要傳到哪個 PE：可以使用 Multicast Network 解決。最簡單的做法是廣播資料，再由 PE Array 中的 Control Logic 篩選每個 PE 需要的資料。

Dataflow Comparision

接著比較 WS、OS、NLR 與 RS 的能耗表現。

在 Convolutional Layer 中，RS 因為最大化 Register File 中的資料重複使用，所以 RF 能耗較高，但整體能耗最低。NLR 因為沒有 RF，資料都放在 Buffer，因此 Buffer 能耗最高。OSA 專門處理 Convolution，所以能耗也比 OSC 更低。

從另一個角度看，WS 因為最大化 Weight reuse，所以 Weight 能耗最低；OS 因為最大化 Partial Sum reuse，所以 Partial Sum 能耗最低。

在 FC Layer 中，OSC 因為更適合 FC Layer，所以能耗比 OSA 小。

最後一張圖是使用 RS Dataflow 跑 AlexNet 的能耗分析。L1 ~ L5 多為 Convolutional Layer，RF 能耗較高；L6 ~ L8 多為 FC Layer，DRAM 能耗較高。整體來看，L1 ~ L5 消耗了大部分能量，而後續神經網路也越來越偏向大量使用 Convolution，因此改善 Convolution 的資料搬移與重複使用會非常重要。

張君實

使用nni尋找最佳超參數

前言

作業檔案簡介

如何從頭復現 NNI Training

如何直接看 Training Log

實驗設定

Result

筆記

如何用 Python API 調 Hyperparameter

小結

Reference

透過docker建立jupyter與tensorboard環境

前言

專案結構

執行流程

Local Forwarding

Jupyter 與 TensorBoard 指令

Image 與 Container

常用 Image 指令

Dockerfile 範例

常用 Container 指令

Docker Compose

docker-compose.yml 範例

小結

Reference

LLaMA中使用的Positional Embedding

前言

Positional Encoding 簡介

LLaMA使用的Positional Embedding

RoPE Code 分析

小結

Reference

為什麼torch.nn.transformer中每個input的feature size需要是head數量的倍數

前言

Multi-Head Transformer 理論

為什麼 d_model 需要被 nhead 整除

小結

Efficient Processing of Deep Neural Networks, A Tutorial and Survey小結精讀

Introduction to Hardware

Temporal Architecture and Spatial Architecture

Temporal Architecture

FC Layer and Convolutional Layer

Convolution Optimization

Spatial Architecture

Data Reuse

Term Explanation

Dataflow

Weight Stationary（WS）

Output Stationary（OS）

No Local Reuse（NLR）

Row Stationary（RS）

Dataflow Example

Dataflow Comparision

Reference

為什麼 `d_model` 需要被 `nhead` 整除