<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <author>
    <name>張君實</name>
  </author>
  <generator uri="https://hexo.io/">Hexo</generator>
  <id>https://www.threemonth03.com/</id>
  <link href="https://www.threemonth03.com/" rel="alternate"/>
  <link href="https://www.threemonth03.com/atom.xml" rel="self"/>
  <rights>All rights reserved 2026, 張君實</rights>
  <subtitle>技術筆記與研究紀錄</subtitle>
  <title>張君實</title>
  <updated>2026-05-10T14:11:04.358Z</updated>
  <entry>
    <author>
      <name>張君實</name>
    </author>
    <category term="機器學習" scheme="https://www.threemonth03.com/categories/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92/"/>
    <category term="NNI" scheme="https://www.threemonth03.com/tags/NNI/"/>
    <category term="Hyperparameter Tuning" scheme="https://www.threemonth03.com/tags/Hyperparameter-Tuning/"/>
    <category term="CIFAR10" scheme="https://www.threemonth03.com/tags/CIFAR10/"/>
    <content>
      <![CDATA[<span id="more"></span><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理如何使用 NNI（Neural Network Intelligence）做超參數搜尋。範例任務是用 CNN 訓練 CIFAR-10，並透過 NNI 搜尋 learning rate、momentum 與 batch size。</p><h2 id="作業檔案簡介"><a href="#作業檔案簡介" class="headerlink" title="作業檔案簡介"></a>作業檔案簡介</h2><p>程式碼放在 <a href="https://github.com/ThreeMonth03/hyperparameter_tuning">ThreeMonth03&#x2F;hyperparameter_tuning</a>。</p><p>主要目錄如下：</p><ul><li><code>config/</code>：放 <code>requirement.txt</code>。</li><li><code>src/</code>：放 source code，包含 <code>cnn.py</code> 與 <code>nni_search.py</code>。</li><li><code>log/</code>：放 NNI experiment log，可以回放歷史 training 紀錄。</li></ul><h2 id="如何從頭復現-NNI-Training"><a href="#如何從頭復現-NNI-Training" class="headerlink" title="如何從頭復現 NNI Training"></a>如何從頭復現 NNI Training</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ThreeMonth03/hyperparameter_tuning.git</span><br><span class="line"><span class="built_in">cd</span> hyperparameter_tuning</span><br><span class="line">docker-compose up</span><br></pre></td></tr></table></figure><p>接著在瀏覽器打開：</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">http://localhost:[your_port]</span><br></pre></td></tr></table></figure><p>這裡建議不要使用 <code>docker-compose up -d</code>，否則 experiment log 可能不會被正常保存。實際部署時，也記得依照環境調整 port、container name 與 image name。</p><h2 id="如何直接看-Training-Log"><a href="#如何直接看-Training-Log" class="headerlink" title="如何直接看 Training Log"></a>如何直接看 Training Log</h2><p>如果只想查看既有 log，可以改用 <code>nni_search.py</code> 裡的 <code>experiment.view</code>：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">experiment.view(experiment_id, port=<span class="number">8323</span>, non_blocking=<span class="literal">False</span>)</span><br></pre></td></tr></table></figure><p>操作流程：</p><ol><li>Clone repo。</li><li>依照 <code>nni_search.py</code> 內的註解，關閉 training 設定，打開 <code>experiment.view(...)</code>。</li><li>執行 <code>docker-compose up</code>。</li><li>到 <code>localhost:[your_port]</code> 查看結果。</li></ol><h2 id="實驗設定"><a href="#實驗設定" class="headerlink" title="實驗設定"></a>實驗設定</h2><table><thead><tr><th>Hyperparameter</th><th>Search Space</th></tr></thead><tbody><tr><td><code>lr</code></td><td><code>0.0001 ~ 0.1</code>，log uniform</td></tr><tr><td><code>momentum</code></td><td><code>0 ~ 1</code>，uniform</td></tr><tr><td><code>batch_size</code></td><td><code>4</code>、<code>8</code>、<code>16</code></td></tr><tr><td>Tuner</td><td>TPE</td></tr></tbody></table><h3 id="Result"><a href="#Result" class="headerlink" title="Result"></a>Result</h3><p>Best hyperparameter：</p><ul><li><code>lr</code>: <code>0.0024724673142795927</code></li><li><code>momentum</code>: <code>0.31344560117709097</code></li><li><code>batch_size</code>: <code>8</code></li></ul><p>Test Accuracy：<code>65%</code></p><img src="https://i.imgur.com/o8f06cB.png"> <img src="https://i.imgur.com/JzeBAuD.png"> <h2 id="筆記"><a href="#筆記" class="headerlink" title="筆記"></a>筆記</h2><h3 id="如何用-Python-API-調-Hyperparameter"><a href="#如何用-Python-API-調-Hyperparameter" class="headerlink" title="如何用 Python API 調 Hyperparameter"></a>如何用 Python API 調 Hyperparameter</h3><p>NNI 可以透過 terminal 指令或 Python API 控制 hyperparameter。以下是透過 Python API 設定 search space 與 experiment 的範例。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nni_search.py</span></span><br><span class="line">search_space = &#123;</span><br><span class="line">    <span class="string">&#x27;lr&#x27;</span>: &#123;<span class="string">&#x27;_type&#x27;</span>: <span class="string">&#x27;loguniform&#x27;</span>, <span class="string">&#x27;_value&#x27;</span>: [<span class="number">0.0001</span>, <span class="number">0.1</span>]&#125;,</span><br><span class="line">    <span class="string">&#x27;momentum&#x27;</span>: &#123;<span class="string">&#x27;_type&#x27;</span>: <span class="string">&#x27;uniform&#x27;</span>, <span class="string">&#x27;_value&#x27;</span>: [<span class="number">0</span>, <span class="number">1</span>]&#125;,</span><br><span class="line">    <span class="string">&#x27;batch_size&#x27;</span>: &#123;<span class="string">&quot;_type&quot;</span>: <span class="string">&quot;choice&quot;</span>, <span class="string">&quot;_value&quot;</span>: [<span class="number">4</span>, <span class="number">8</span>, <span class="number">16</span>]&#125;,</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> nni</span><br><span class="line"><span class="keyword">from</span> nni.experiment <span class="keyword">import</span> Experiment</span><br><span class="line"></span><br><span class="line">experiment = Experiment(<span class="string">&#x27;local&#x27;</span>)</span><br><span class="line">experiment.config.trial_command = <span class="string">&#x27;python src/cnn.py&#x27;</span></span><br><span class="line">experiment.config.trial_code_directory = <span class="string">&#x27;.&#x27;</span></span><br><span class="line">experiment.config.search_space = search_space</span><br><span class="line">experiment.config.tuner.name = <span class="string">&#x27;TPE&#x27;</span></span><br><span class="line">experiment.config.tuner.class_args[<span class="string">&#x27;optimize_mode&#x27;</span>] = <span class="string">&#x27;maximize&#x27;</span></span><br><span class="line">experiment.config.max_trial_number = <span class="number">50</span></span><br><span class="line">experiment.config.trial_concurrency = <span class="number">10</span></span><br><span class="line">experiment.config.trial_gpu_number = <span class="number">3</span></span><br><span class="line">experiment.config.debug = <span class="literal">True</span></span><br><span class="line">experiment.config.experiment_working_directory = <span class="string">&#x27;./log&#x27;</span></span><br><span class="line">experiment.config.training_service.use_active_gpu = <span class="literal">True</span></span><br><span class="line">experiment.config.training_service.max_trial_number_per_gpu = <span class="number">10</span></span><br><span class="line"></span><br><span class="line">experiment.run(<span class="number">8323</span>)</span><br><span class="line"><span class="built_in">print</span>(experiment.get_status())</span><br><span class="line"><span class="built_in">print</span>(experiment.get_job_statistics())</span><br><span class="line"><span class="built_in">print</span>(experiment.list_trial_jobs())</span><br><span class="line"></span><br><span class="line"><span class="built_in">input</span>(<span class="string">&#x27;Press enter to quit&#x27;</span>)</span><br><span class="line">experiment.stop()</span><br></pre></td></tr></table></figure><p>被控制的 model 也要加入 NNI 參數讀取與回報結果的邏輯。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># cnn.py</span></span><br><span class="line"><span class="keyword">import</span> nni</span><br><span class="line"><span class="comment">#......</span></span><br><span class="line">params = &#123;</span><br><span class="line">    <span class="string">&#x27;lr&#x27;</span>: <span class="number">0.001</span>,</span><br><span class="line">    <span class="string">&#x27;momentum&#x27;</span>: <span class="number">0</span>,</span><br><span class="line">    <span class="string">&#x27;batch_size&#x27;</span>: <span class="number">4</span>,</span><br><span class="line">&#125;</span><br><span class="line">optimized_params = nni.get_next_parameter()</span><br><span class="line">params.update(optimized_params)</span><br><span class="line"><span class="built_in">print</span>(params)</span><br><span class="line"><span class="comment">#......</span></span><br><span class="line">epoches = <span class="number">20</span></span><br><span class="line">batch_size = params[<span class="string">&#x27;batch_size&#x27;</span>]</span><br><span class="line">lr = params[<span class="string">&#x27;lr&#x27;</span>]</span><br><span class="line">momentum = params[<span class="string">&#x27;momentum&#x27;</span>]</span><br><span class="line"><span class="comment">#......</span></span><br><span class="line"><span class="keyword">with</span> torch.no_grad():</span><br><span class="line">    <span class="keyword">for</span> data <span class="keyword">in</span> testloader:</span><br><span class="line">        images, labels = data[<span class="number">0</span>].to(device), data[<span class="number">1</span>].to(device)</span><br><span class="line">        outputs = net(images)</span><br><span class="line">        _, predicted = torch.<span class="built_in">max</span>(outputs.data, <span class="number">1</span>)</span><br><span class="line">        total += labels.size(<span class="number">0</span>)</span><br><span class="line">        correct += (predicted == labels).<span class="built_in">sum</span>().item()</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span>(<span class="string">f&#x27;Accuracy of the network on the 10000 test images: <span class="subst">&#123;<span class="number">100</span> * correct // total&#125;</span> %&#x27;</span>)</span><br><span class="line">nni.report_final_result(<span class="number">100</span> * correct // total)</span><br></pre></td></tr></table></figure><h2 id="小結"><a href="#小結" class="headerlink" title="小結"></a>小結</h2><p>NNI 的好處是可以把「手動反覆調參」變成可重現的實驗流程。只要把 search space、tuner 與 training script 接好，就能自動化比較不同超參數組合，並保留 experiment log 供後續分析。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><a href="https://nni.readthedocs.io/en/stable/">NNI Documentation</a></li><li><a href="https://github.com/microsoft/nni">microsoft&#x2F;nni</a></li></ul>]]>
    </content>
    <id>https://www.threemonth03.com/2023/08/27/2023-08-27-%E4%BD%BF%E7%94%A8nni%E6%89%BE%E6%9C%80%E4%BD%B3%E8%B6%85%E5%8F%83%E6%95%B8/</id>
    <link href="https://www.threemonth03.com/2023/08/27/2023-08-27-%E4%BD%BF%E7%94%A8nni%E6%89%BE%E6%9C%80%E4%BD%B3%E8%B6%85%E5%8F%83%E6%95%B8/"/>
    <published>2023-08-27T02:13:00.000Z</published>
    <summary>
      <![CDATA[<span id="more"></span>

<h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理如何使用 NNI（Neural Network Intelligence）做超]]>
    </summary>
    <title>使用nni尋找最佳超參數</title>
    <updated>2026-05-10T14:11:04.358Z</updated>
  </entry>
  <entry>
    <author>
      <name>張君實</name>
    </author>
    <category term="開發環境" scheme="https://www.threemonth03.com/categories/%E9%96%8B%E7%99%BC%E7%92%B0%E5%A2%83/"/>
    <category term="Docker" scheme="https://www.threemonth03.com/tags/Docker/"/>
    <category term="Jupyter" scheme="https://www.threemonth03.com/tags/Jupyter/"/>
    <category term="TensorBoard" scheme="https://www.threemonth03.com/tags/TensorBoard/"/>
    <content>
      <![CDATA[<span id="more"></span><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理如何用 Docker Compose 建立 Jupyter 與 TensorBoard 環境，並透過 local forwarding 在本機瀏覽器使用遠端服務。</p><p>範例 repo 放在 <a href="https://github.com/ThreeMonth03/Docker_example">ThreeMonth03&#x2F;Docker_example</a>。</p><h2 id="專案結構"><a href="#專案結構" class="headerlink" title="專案結構"></a>專案結構</h2><p>這個 repository 裡有幾個重點：</p><ul><li><code>jupyter/</code>：Jupyter 服務的 Dockerfile。</li><li><code>tensorboard/</code>：TensorBoard 服務的 Dockerfile。</li><li><code>docker-compose.yml</code>：管理兩個 image 與 container。</li><li><code>main.ipynb</code>、<code>logs/</code>：用來驗證 Jupyter 與 TensorBoard 是否正常。</li></ul><h2 id="執行流程"><a href="#執行流程" class="headerlink" title="執行流程"></a>執行流程</h2><p>如果服務跑在遠端機器上，可以先透過 SSH local forwarding 把 port 轉回本機。假設要轉兩個服務：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ssh -L localhost:8323:localhost:8323 [Account]@[Server IP]</span><br><span class="line">ssh -L localhost:8324:localhost:8324 [Account]@[Server IP]</span><br></pre></td></tr></table></figure><p>接著在遠端機器 clone repo，並啟動服務：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ThreeMonth03/Docker_example.git</span><br><span class="line"><span class="built_in">cd</span> Docker_example</span><br><span class="line">docker-compose up -d</span><br></pre></td></tr></table></figure><p>啟動後，在本機瀏覽器打開：</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">http://localhost:8323/</span><br><span class="line">http://localhost:8324/</span><br></pre></td></tr></table></figure><p>使用完後關閉服務：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose down</span><br></pre></td></tr></table></figure><p>如果曾經修改過 <code>docker-compose.yml</code>，導致出現 orphan container，可以改用：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose down --remove-orphans</span><br></pre></td></tr></table></figure><h2 id="Local-Forwarding"><a href="#Local-Forwarding" class="headerlink" title="Local Forwarding"></a>Local Forwarding</h2><p>有時候遠端 server 的特定 port 不會直接對外開放。這時可以透過 SSH local forwarding，讓本機 port 對應到遠端 server 的 port。</p><img src="https://johnliu55.tw/ssh-tunnel/images/local_scenario1_problem.png" alt="防火牆" title="防火牆"> <p>基本格式如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ssh -L localhost:[your_computer_port]:localhost:[server_port] [Account]@[Server IP]</span><br></pre></td></tr></table></figure><img src="https://johnliu55.tw/ssh-tunnel/images/local_scenario1_solved.png" alt="Local Forwarding" title="Local Forwarding"> <p>例如帳號是 <code>threemonth</code>，server IP 是 <code>123.456.78.901</code>，要把本機 <code>9090</code> 對到遠端 <code>8080</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ssh -L localhost:9090:localhost:8080 threemonth@123.456.78.901</span><br></pre></td></tr></table></figure><h2 id="Jupyter-與-TensorBoard-指令"><a href="#Jupyter-與-TensorBoard-指令" class="headerlink" title="Jupyter 與 TensorBoard 指令"></a>Jupyter 與 TensorBoard 指令</h2><p>Jupyter 可以用以下指令啟動：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">jupyter notebook \</span><br><span class="line">  --no-browser \</span><br><span class="line">  --ip=0.0.0.0 \</span><br><span class="line">  --port=8080 \</span><br><span class="line">  --allow-root \</span><br><span class="line">  --NotebookApp.token=<span class="string">&#x27;&#x27;</span> \</span><br><span class="line">  --NotebookApp.password=<span class="string">&#x27;&#x27;</span></span><br></pre></td></tr></table></figure><p>幾個參數用途：</p><ul><li><code>--no-browser</code>：避免 server 嘗試打開瀏覽器。</li><li><code>--ip=0.0.0.0</code>：讓外部位址可以連到 Jupyter 服務。</li><li><code>--port</code>：指定 container 內服務 port。</li><li><code>--allow-root</code>：允許 root 身分執行 Jupyter。</li><li><code>--NotebookApp.token=&#39;&#39;</code> 與 <code>--NotebookApp.password=&#39;&#39;</code>：關閉 token 與 password 驗證，適合搭配受控環境或 tunnel 使用。</li></ul><p>TensorBoard 可以用以下指令啟動：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tensorboard --logdir ./logs --host=0.0.0.0 --port=8081</span><br></pre></td></tr></table></figure><p>幾個參數用途：</p><ul><li><code>--logdir</code>：指定 log 路徑。</li><li><code>--host=0.0.0.0</code>：讓外部位址可以連到 TensorBoard。</li><li><code>--port</code>：指定服務 port。</li></ul><h2 id="Image-與-Container"><a href="#Image-與-Container" class="headerlink" title="Image 與 Container"></a>Image 與 Container</h2><p>Docker 裡最常遇到兩個概念：image 與 container。</p><ul><li>Image：環境模板，描述要安裝哪些套件、預設執行什麼指令。</li><li>Container：根據 image 開出來的執行實例，真正提供服務。</li></ul><p>Image 可以保留並重複使用；container 用完後通常可以刪掉，下次再用同一個 image 開新的 container。</p><h3 id="常用-Image-指令"><a href="#常用-Image-指令" class="headerlink" title="常用 Image 指令"></a>常用 Image 指令</h3><p>根據 Dockerfile 建立 image：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker build -t [image_name] [path]</span><br></pre></td></tr></table></figure><p>例如 Dockerfile 在目前資料夾，要建立 <code>jupyter_image</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker build -t jupyter_image .</span><br></pre></td></tr></table></figure><p>如果想忽略 cache 重新 build：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker build -t [image_name] [path] --no-cache</span><br></pre></td></tr></table></figure><p>查看 image：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker images</span><br></pre></td></tr></table></figure><p>刪除 image：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker image <span class="built_in">rm</span> [image_name]</span><br></pre></td></tr></table></figure><h3 id="Dockerfile-範例"><a href="#Dockerfile-範例" class="headerlink" title="Dockerfile 範例"></a>Dockerfile 範例</h3><p>以下是一個 TensorBoard image 的 Dockerfile：</p><figure class="highlight dockerfile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">FROM</span> pytorch/pytorch:<span class="number">1.13</span>.<span class="number">0</span>-cuda11.<span class="number">6</span>-cudnn8-devel</span><br><span class="line"></span><br><span class="line"><span class="keyword">RUN</span><span class="language-bash"> apt-get update &amp;&amp;\</span></span><br><span class="line"><span class="language-bash">    apt-get -y upgrade &amp;&amp;\</span></span><br><span class="line"><span class="language-bash">    apt-get install -y git net-tools vim <span class="built_in">sudo</span> tcsh gcc g++ unzip python3 python3-pip &amp;&amp;\</span></span><br><span class="line"><span class="language-bash">    apt-get clean &amp;&amp;\</span></span><br><span class="line"><span class="language-bash">    <span class="built_in">rm</span> -rf /var/lib/apt/lists/*</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">RUN</span><span class="language-bash"> pip3 --no-cache-dir install torch \</span></span><br><span class="line"><span class="language-bash">    torchvision \</span></span><br><span class="line"><span class="language-bash">    torchaudio \</span></span><br><span class="line"><span class="language-bash">    tensorboard \</span></span><br><span class="line"><span class="language-bash">    jupyterlab \</span></span><br><span class="line"><span class="language-bash">    jupyter</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">CMD</span><span class="language-bash"> [<span class="string">&quot;tensorboard&quot;</span>, <span class="string">&quot;--logdir&quot;</span>, <span class="string">&quot;./logs&quot;</span>, <span class="string">&quot;--host=0.0.0.0&quot;</span>, <span class="string">&quot;--port=8324&quot;</span>]</span></span><br></pre></td></tr></table></figure><p>重點如下：</p><ul><li><code>FROM</code>：指定 base image。</li><li><code>RUN</code>：安裝套件或執行建置指令。</li><li><code>CMD</code>：container 啟動後預設執行的 command。</li></ul><h3 id="常用-Container-指令"><a href="#常用-Container-指令" class="headerlink" title="常用 Container 指令"></a>常用 Container 指令</h3><p>建好 image 後，就可以用 <code>docker run</code> 建立並執行 container：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run [options] [image_name] [<span class="built_in">command</span>]</span><br></pre></td></tr></table></figure><p>常見 options：</p><ul><li><code>-it</code>：開啟互動式 terminal，常搭配 <code>bash</code> 使用。</li><li><code>--name</code>：指定 container 名稱，建議加上，方便管理。</li><li><code>-p</code>：做 port mapping，例如 <code>8080:8080</code>。</li><li><code>-v</code>：mount 本機資料夾到 container 內。</li><li><code>--gpus all</code>：讓 container 使用 GPU。</li><li><code>command</code>：覆蓋 image 中的預設 <code>CMD</code>。</li></ul><p>只執行 <code>jupyter_image</code>，並命名為 <code>jupyter_container</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run --name jupyter_container jupyter_image</span><br></pre></td></tr></table></figure><p>如果想開 terminal、轉 port、掛載目前資料夾、並使用 GPU：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -it --name jupyter_container -p 8080:8080 -v ./:/workspace --gpus all jupyter_image bash</span><br></pre></td></tr></table></figure><p>離開 container：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">exit</span></span><br></pre></td></tr></table></figure><p>重新啟動並 attach：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">docker start [container_name]</span><br><span class="line">docker attach [container_name]</span><br></pre></td></tr></table></figure><p>查看 container：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">docker ps</span><br><span class="line">docker ps -a</span><br></pre></td></tr></table></figure><p>刪除 container：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker <span class="built_in">rm</span> [CONTAINER ID]</span><br></pre></td></tr></table></figure><h2 id="Docker-Compose"><a href="#Docker-Compose" class="headerlink" title="Docker Compose"></a>Docker Compose</h2><p>如果一次要管理多個服務，例如 Jupyter 與 TensorBoard，就適合使用 Docker Compose。它可以用一份 <code>docker-compose.yml</code> 管理多個 image、container、port 與 volume。</p><p>啟動服務：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose up</span><br></pre></td></tr></table></figure><p>讓服務在背景執行：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose up -d</span><br></pre></td></tr></table></figure><p>停用服務：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose down</span><br></pre></td></tr></table></figure><p>清掉 orphan container：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose down --remove-orphans</span><br></pre></td></tr></table></figure><h3 id="docker-compose-yml-範例"><a href="#docker-compose-yml-範例" class="headerlink" title="docker-compose.yml 範例"></a>docker-compose.yml 範例</h3><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">version:</span> <span class="string">&quot;3&quot;</span></span><br><span class="line"><span class="attr">services:</span></span><br><span class="line">  <span class="attr">Jupyter:</span></span><br><span class="line">    <span class="attr">build:</span> <span class="string">./jupyter</span></span><br><span class="line">    <span class="attr">image:</span> <span class="string">docker/threemonth</span></span><br><span class="line">    <span class="attr">container_name:</span> <span class="string">jupyterthreemonth</span></span><br><span class="line">    <span class="attr">ports:</span> </span><br><span class="line">    <span class="bullet">-</span> <span class="string">&quot;8323:8323&quot;</span></span><br><span class="line">    <span class="attr">volumes:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">./:/workspace</span> </span><br><span class="line">    <span class="attr">restart:</span> <span class="string">unless-stopped</span></span><br><span class="line">    <span class="attr">command:</span> <span class="string">jupyter</span> <span class="string">notebook</span> <span class="string">--no-browser</span> <span class="string">--ip=0.0.0.0</span> <span class="string">--port=8323</span> <span class="string">--allow-root</span> <span class="string">--NotebookApp.token=&#x27;&#x27;</span> <span class="string">--NotebookApp.password=&#x27;&#x27;</span></span><br><span class="line"></span><br><span class="line">  <span class="attr">Tensorboard:</span></span><br><span class="line">    <span class="attr">build:</span> <span class="string">./tensorboard</span></span><br><span class="line">    <span class="attr">image:</span> <span class="string">docker/threemonth2</span></span><br><span class="line">    <span class="attr">container_name:</span> <span class="string">tensorboardthreemonth</span></span><br><span class="line">    <span class="attr">ports:</span> </span><br><span class="line">    <span class="bullet">-</span> <span class="string">&quot;8324:8324&quot;</span></span><br><span class="line">    <span class="attr">depends_on:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">Jupyter</span></span><br><span class="line">    <span class="attr">volumes:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">./:/workspace</span> </span><br><span class="line">    <span class="attr">restart:</span> <span class="string">unless-stopped</span></span><br></pre></td></tr></table></figure><p>幾個欄位對應到 <code>docker run</code> 的概念：</p><ul><li><code>build</code>：指定 Dockerfile 所在目錄。</li><li><code>image</code>：image 名稱。</li><li><code>container_name</code>：container 名稱。</li><li><code>ports</code>：對應 <code>docker run -p</code>。</li><li><code>volumes</code>：對應 <code>docker run -v</code>。</li><li><code>restart: unless-stopped</code>：除非手動停止，否則 container 掛掉後會自動重啟。</li><li><code>depends_on</code>：控制服務啟動順序。</li><li><code>command</code>：覆蓋 Dockerfile 中的預設 <code>CMD</code>。</li></ul><h2 id="小結"><a href="#小結" class="headerlink" title="小結"></a>小結</h2><p>Docker Compose 很適合用來管理多個彼此相關的開發服務。這個範例把 Jupyter 與 TensorBoard 分成兩個 container，再透過 volume 共用工作目錄，最後用 local forwarding 讓本機可以安全地連到遠端服務。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><a href="https://yeasy.gitbook.io/docker_practice/">Docker 從入門到實踐</a></li><li><a href="https://johnliu55.tw/ssh-tunnel.html">SSH Tunnel 筆記</a></li><li><a href="https://azole.medium.com/docker-container-%E5%9F%BA%E7%A4%8E%E5%85%A5%E9%96%80%E7%AF%87-2-c14d8f852ae4">Docker Container 基礎入門篇</a></li></ul>]]>
    </content>
    <id>https://www.threemonth03.com/2023/07/21/2023-07-21-%E9%80%8F%E9%81%8Edocker%E5%BB%BA%E7%AB%8Bjupyter%E8%88%87tensorboard%E7%92%B0%E5%A2%83/</id>
    <link href="https://www.threemonth03.com/2023/07/21/2023-07-21-%E9%80%8F%E9%81%8Edocker%E5%BB%BA%E7%AB%8Bjupyter%E8%88%87tensorboard%E7%92%B0%E5%A2%83/"/>
    <published>2023-07-20T19:18:00.000Z</published>
    <summary>
      <![CDATA[<span id="more"></span>

<h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理如何用 Docker Compose 建立 Jupyter 與 TensorB]]>
    </summary>
    <title>透過docker建立jupyter與tensorboard環境</title>
    <updated>2026-05-10T14:12:45.329Z</updated>
  </entry>
  <entry>
    <author>
      <name>張君實</name>
    </author>
    <category term="深度學習" scheme="https://www.threemonth03.com/categories/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92/"/>
    <category term="Transformer" scheme="https://www.threemonth03.com/tags/Transformer/"/>
    <category term="LLaMA" scheme="https://www.threemonth03.com/tags/LLaMA/"/>
    <category term="RoPE" scheme="https://www.threemonth03.com/tags/RoPE/"/>
    <content>
      <![CDATA[<span id="more"></span><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理 LLaMA 使用的 Positional Embedding，也就是 RoPE（Rotary Position Embedding）。Positional Encoding 是 Transformer 中很容易被略過、但其實很關鍵的細節：它讓模型能在沒有 RNN 時間序列結構的情況下，仍然保留 token 的順序資訊。</p><h2 id="Positional-Encoding-簡介"><a href="#Positional-Encoding-簡介" class="headerlink" title="Positional Encoding 簡介"></a>Positional Encoding 簡介</h2><p>Transformer 會一次讀入整段 input，不像 RNN 會依照時間序列逐步傳入。因此，input embedding 必須額外加入位置資訊。</p><p>舉例來說：</p><p align="center"><strong><em>I Am that I Am.</em></strong></p><p>如果用 RNN 處理，token 會按照順序被送進模型；但 Transformer 一次看到整句話，如果沒有位置資訊，就很難分辨不同位置的 <code>I</code> 或 <code>Am</code> 在關係計算中的差異。</p><p>Positional Encoding 常見做法可以分成三類：</p><ol><li>絕對位置編碼：最直覺的做法是直接對 input embedding 加上 index。不過 index 很大時，可能影響原本 embedding 的語意資訊。</li><li>相對位置編碼：直接建模 token 之間的相對距離，例如 <code>Self-Attention with Relative Position Representations</code>，可以減少部分 weight matrix 的運算。</li><li>融合式：表面上使用絕對位置編碼，但經過 attention 內積後，結果會呈現相對位置關係。常見例子包含 <code>Attention Is All You Need</code> 的三角函數位置編碼，以及 RoPE。</li></ol><p>RoPE 的特色是使用複數旋轉來編碼位置，而 LLaMA 採用的正是這個方法。</p><h2 id="LLaMA使用的Positional-Embedding"><a href="#LLaMA使用的Positional-Embedding" class="headerlink" title="LLaMA使用的Positional Embedding"></a>LLaMA使用的Positional Embedding</h2><p>LLaMA 使用的 Positional Embedding 是 RoPE。它可以視為一種融合絕對位置與相對位置資訊的方法；如果硬要分類，會比較接近「用絕對位置編碼達成相對位置效果」。</p><p>RoPE 的核心形式如下：</p><img src="https://i.imgur.com/dujsQsd.png" alt="RoPE演算法" title="RoPE演算法">  <p>兩個複數做內積時，可以理解成將其中一個複數取共軛後相乘，再取實部。複數與共軛複數相乘時，指數部分會變成相減，因此 RoPE 可以把絕對位置放進歐拉表示中，再透過 attention 內積留下相對位置資訊。</p><h2 id="RoPE-Code-分析"><a href="#RoPE-Code-分析" class="headerlink" title="RoPE Code 分析"></a>RoPE Code 分析</h2><p>以下程式碼來自 <a href="https://github.com/facebookresearch/llama/blob/main/llama/model.py#L56"><code>facebookresearch/llama</code></a>。RoPE 最重要的三個 function 是 <code>precompute_freqs_cis</code>、<code>reshape_for_broadcast</code> 與 <code>apply_rotary_emb</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#......</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">precompute_freqs_cis</span>(<span class="params">dim: <span class="built_in">int</span>, end: <span class="built_in">int</span>, theta: <span class="built_in">float</span> = <span class="number">10000.0</span></span>):</span><br><span class="line">    freqs = <span class="number">1.0</span> / (theta ** (torch.arange(<span class="number">0</span>, dim, <span class="number">2</span>)[: (dim // <span class="number">2</span>)].<span class="built_in">float</span>() / dim))</span><br><span class="line">    t = torch.arange(end, device=freqs.device)  <span class="comment"># type: ignore</span></span><br><span class="line">    freqs = torch.outer(t, freqs).<span class="built_in">float</span>()  <span class="comment"># type: ignore</span></span><br><span class="line">    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  <span class="comment"># complex64</span></span><br><span class="line">    <span class="keyword">return</span> freqs_cis</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">reshape_for_broadcast</span>(<span class="params">freqs_cis: torch.Tensor, x: torch.Tensor</span>):</span><br><span class="line">    ndim = x.ndim</span><br><span class="line">    <span class="keyword">assert</span> <span class="number">0</span> &lt;= <span class="number">1</span> &lt; ndim</span><br><span class="line">    <span class="keyword">assert</span> freqs_cis.shape == (x.shape[<span class="number">1</span>], x.shape[-<span class="number">1</span>])</span><br><span class="line">    shape = [d <span class="keyword">if</span> i == <span class="number">1</span> <span class="keyword">or</span> i == ndim - <span class="number">1</span> <span class="keyword">else</span> <span class="number">1</span> <span class="keyword">for</span> i, d <span class="keyword">in</span> <span class="built_in">enumerate</span>(x.shape)]</span><br><span class="line">    <span class="keyword">return</span> freqs_cis.view(*shape)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">apply_rotary_emb</span>(<span class="params"></span></span><br><span class="line"><span class="params">    xq: torch.Tensor,</span></span><br><span class="line"><span class="params">    xk: torch.Tensor,</span></span><br><span class="line"><span class="params">    freqs_cis: torch.Tensor,</span></span><br><span class="line"><span class="params"></span>) -&gt; <span class="type">Tuple</span>[torch.Tensor, torch.Tensor]:</span><br><span class="line">    xq_ = torch.view_as_complex(xq.<span class="built_in">float</span>().reshape(*xq.shape[:-<span class="number">1</span>], -<span class="number">1</span>, <span class="number">2</span>))</span><br><span class="line">    xk_ = torch.view_as_complex(xk.<span class="built_in">float</span>().reshape(*xk.shape[:-<span class="number">1</span>], -<span class="number">1</span>, <span class="number">2</span>))</span><br><span class="line">    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)</span><br><span class="line">    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(<span class="number">3</span>)</span><br><span class="line">    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(<span class="number">3</span>)</span><br><span class="line">    <span class="keyword">return</span> xq_out.type_as(xq), xk_out.type_as(xk)</span><br><span class="line"><span class="comment">#......</span></span><br></pre></td></tr></table></figure><p><code>precompute_freqs_cis</code> 會先計算每個位置的歐拉表示，也就是 $e^{im\theta}$ 與 $e^{in\theta}$。</p><p><code>apply_rotary_emb</code> 則會計算 $q_m e^{im\theta}$ 與 $k_n e^{in\theta}$，再把結果轉成實數表示，例如 $[q_m cos(im\theta), q_m sin(im\theta)]$ 與 $[k_n cos(in\theta), k_n sin(in\theta)]$。後續 attention 內積就會直接使用這些帶有旋轉位置資訊的 Q、K。</p><p><code>reshape_for_broadcast</code> 的用途是把 <code>xq_</code>、<code>xk_</code> 與 <code>freqs_cis</code> 調整成可以 broadcast 的形狀，讓矩陣能逐元素相乘。</p><h2 id="小結"><a href="#小結" class="headerlink" title="小結"></a>小結</h2><p>RoPE 有趣的地方在於，它把位置資訊藏在複數旋轉裡，最後透過 attention 的內積自然留下相對位置關係。這種設計把訊號處理與深度學習結合得很漂亮，也讓 Positional Encoding 不只是「加上一個位置向量」這麼簡單。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><a href="https://blog.csdn.net/weixin_44826203/article/details/129255185">CSDN RoPE 筆記</a></li><li><a href="https://cloud.tencent.com/developer/article/2196111">騰訊雲 RoPE 介紹</a></li><li><a href="https://github.com/facebookresearch/llama/blob/main/llama/model.py#L56">facebookresearch&#x2F;llama model.py</a></li><li><a href="https://zhuanlan.zhihu.com/p/398457641">RoPE 相關整理</a></li><li><a href="https://arxiv.org/abs/2104.09864v4">RoFormer: Enhanced Transformer with Rotary Position Embedding</a></li><li><a href="https://kexue.fm/archives/8130">Transformer 升級之路：RoPE</a></li></ul>]]>
    </content>
    <id>https://www.threemonth03.com/2023/07/20/2023-07-20-LLaMA%E4%B8%AD%E4%BD%BF%E7%94%A8%E7%9A%84Positional%20Embedding/</id>
    <link href="https://www.threemonth03.com/2023/07/20/2023-07-20-LLaMA%E4%B8%AD%E4%BD%BF%E7%94%A8%E7%9A%84Positional%20Embedding/"/>
    <published>2023-07-20T00:25:00.000Z</published>
    <summary>
      <![CDATA[<span id="more"></span>

<h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理 LLaMA 使用的 Positional Embedding，也就是 RoP]]>
    </summary>
    <title>LLaMA中使用的Positional Embedding</title>
    <updated>2026-05-10T14:10:22.263Z</updated>
  </entry>
  <entry>
    <author>
      <name>張君實</name>
    </author>
    <category term="深度學習" scheme="https://www.threemonth03.com/categories/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92/"/>
    <category term="PyTorch" scheme="https://www.threemonth03.com/tags/PyTorch/"/>
    <category term="Transformer" scheme="https://www.threemonth03.com/tags/Transformer/"/>
    <content>
      <![CDATA[<span id="more"></span><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理 <code>torch.nn.Transformer</code> 中一個常見限制：<code>d_model</code> 必須能被 <code>nhead</code> 整除。</p><p>從理論上看，Multi-Head Attention 可以想成對同一份 input 做多組 Self-Attention，再把多個 head 的輸出接起來。因此直覺上會覺得 feature size 與 head 數量不一定要整除。但 PyTorch 的實作為了效率，會把 embedding dimension 平均切給每個 head，這就是限制的來源。</p><h2 id="Multi-Head-Transformer-理論"><a href="#Multi-Head-Transformer-理論" class="headerlink" title="Multi-Head Transformer 理論"></a>Multi-Head Transformer 理論</h2><p>Multi-Head Transformer 的概念是：對同一個 input 做多個 Self-Attention，將多次輸出 concat 後，再透過一個矩陣投影回原本大小。具體流程如下：</p><img src="https://pic1.zhimg.com/80/v2-6bdaf739fd6b827b2087b4e151c560f4_720w.webp" alt="Multi-Head Transformer的輸出" title="Multi-Head Transformer的輸出"><img src="https://pic4.zhimg.com/v2-35d78d9aa9150ae4babd0ea6aa68d113_r.jpg" alt="將多個輸出壓回一個輸出的過程" title="將多個輸出壓回一個輸出的過程"><p>在這個抽象描述裡，輸入 <code>x</code> 的 feature size 和 head 數量看起來沒有硬性關係。也就是說，如果只看理論流程，不論 feature size 與 head 數量是多少，似乎都可以訓練。</p><h2 id="為什麼-d-model-需要被-nhead-整除"><a href="#為什麼-d-model-需要被-nhead-整除" class="headerlink" title="為什麼 d_model 需要被 nhead 整除"></a>為什麼 <code>d_model</code> 需要被 <code>nhead</code> 整除</h2><p>如果對 <code>nn.Transformer</code> 填入任意的 feature size 與 head 數量，可能會遇到錯誤訊息，提示 <code>embed_dim</code> 必須能被 <code>num_heads</code> 整除。</p><img src="https://i.imgur.com/H2UvQ1E.png" alt="nn.transformer的報錯範例" title="nn.transformer的報錯範例"><p>原因可以從 PyTorch source code 看出來。首先看 <a href="https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer"><code>nn.Transformer</code></a>，其中與 <code>nhead</code>、<code>d_model</code> 相關的部分會進到 <code>TransformerEncoderLayer</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Transformer</span>(<span class="title class_ inherited__">Module</span>):</span><br><span class="line">    <span class="comment">#......</span></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, d_model: <span class="built_in">int</span> = <span class="number">512</span>, nhead: <span class="built_in">int</span> = <span class="number">8</span>, num_encoder_layers: <span class="built_in">int</span> = <span class="number">6</span>,</span></span><br><span class="line"><span class="params">                 num_decoder_layers: <span class="built_in">int</span> = <span class="number">6</span>, dim_feedforward: <span class="built_in">int</span> = <span class="number">2048</span>, dropout: <span class="built_in">float</span> = <span class="number">0.1</span>,</span></span><br><span class="line"><span class="params">                 activation: <span class="type">Union</span>[<span class="built_in">str</span>, <span class="type">Callable</span>[[Tensor], Tensor]] = F.relu,</span></span><br><span class="line"><span class="params">                 custom_encoder: <span class="type">Optional</span>[<span class="type">Any</span>] = <span class="literal">None</span>, custom_decoder: <span class="type">Optional</span>[<span class="type">Any</span>] = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params">                 layer_norm_eps: <span class="built_in">float</span> = <span class="number">1e-5</span>, batch_first: <span class="built_in">bool</span> = <span class="literal">False</span>, norm_first: <span class="built_in">bool</span> = <span class="literal">False</span>,</span></span><br><span class="line"><span class="params">                 device=<span class="literal">None</span>, dtype=<span class="literal">None</span></span>) -&gt; <span class="literal">None</span>:</span><br><span class="line">        factory_kwargs = &#123;<span class="string">&#x27;device&#x27;</span>: device, <span class="string">&#x27;dtype&#x27;</span>: dtype&#125;</span><br><span class="line">        <span class="built_in">super</span>().__init__()</span><br><span class="line">        torch._C._log_api_usage_once(<span class="string">f&quot;torch.nn.modules.<span class="subst">&#123;self.__class__.__name__&#125;</span>&quot;</span>)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> custom_encoder <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">            <span class="variable language_">self</span>.encoder = custom_encoder</span><br><span class="line">        <span class="keyword">else</span>:</span><br><span class="line">            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,</span><br><span class="line">                                                    activation, layer_norm_eps, batch_first, norm_first,</span><br><span class="line">                                                    **factory_kwargs)</span><br><span class="line">            encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)</span><br><span class="line">            <span class="variable language_">self</span>.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)</span><br><span class="line">        <span class="comment">### ......</span></span><br></pre></td></tr></table></figure><p>接著看 <code>TransformerEncoderLayer</code>，可以發現真正處理 attention 的類別是 <code>MultiheadAttention</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">TransformerEncoderLayer</span>(<span class="title class_ inherited__">Module</span>):</span><br><span class="line">    <span class="comment">#......</span></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, d_model: <span class="built_in">int</span>, nhead: <span class="built_in">int</span>, dim_feedforward: <span class="built_in">int</span> = <span class="number">2048</span>, dropout: <span class="built_in">float</span> = <span class="number">0.1</span>,</span></span><br><span class="line"><span class="params">                 activation: <span class="type">Union</span>[<span class="built_in">str</span>, <span class="type">Callable</span>[[Tensor], Tensor]] = F.relu,</span></span><br><span class="line"><span class="params">                 layer_norm_eps: <span class="built_in">float</span> = <span class="number">1e-5</span>, batch_first: <span class="built_in">bool</span> = <span class="literal">False</span>, norm_first: <span class="built_in">bool</span> = <span class="literal">False</span>,</span></span><br><span class="line"><span class="params">                 device=<span class="literal">None</span>, dtype=<span class="literal">None</span></span>) -&gt; <span class="literal">None</span>:</span><br><span class="line">        factory_kwargs = &#123;<span class="string">&#x27;device&#x27;</span>: device, <span class="string">&#x27;dtype&#x27;</span>: dtype&#125;</span><br><span class="line">        <span class="built_in">super</span>().__init__()</span><br><span class="line">        <span class="variable language_">self</span>.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,</span><br><span class="line">                                            **factory_kwargs)</span><br><span class="line">        <span class="comment">#......</span></span><br><span class="line">    <span class="comment">#......</span></span><br></pre></td></tr></table></figure><p>繼續追 <a href="https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention"><code>MultiheadAttention</code></a>，會看到它把 forward 的核心邏輯交給 <code>F.multi_head_attention_forward</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MultiheadAttention</span>(<span class="title class_ inherited__">Module</span>):</span><br><span class="line">    <span class="comment">#.......</span></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">forward</span>(<span class="params"></span></span><br><span class="line"><span class="params">            self,</span></span><br><span class="line"><span class="params">            query: Tensor,</span></span><br><span class="line"><span class="params">            key: Tensor,</span></span><br><span class="line"><span class="params">            value: Tensor,</span></span><br><span class="line"><span class="params">            key_padding_mask: <span class="type">Optional</span>[Tensor] = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params">            need_weights: <span class="built_in">bool</span> = <span class="literal">True</span>,</span></span><br><span class="line"><span class="params">            attn_mask: <span class="type">Optional</span>[Tensor] = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params">            average_attn_weights: <span class="built_in">bool</span> = <span class="literal">True</span>,</span></span><br><span class="line"><span class="params">            is_causal : <span class="built_in">bool</span> = <span class="literal">False</span></span>) -&gt; <span class="type">Tuple</span>[Tensor, <span class="type">Optional</span>[Tensor]]:</span><br><span class="line">        <span class="comment">#.......</span></span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> <span class="variable language_">self</span>._qkv_same_embed_dim:</span><br><span class="line">            attn_output, attn_output_weights = F.multi_head_attention_forward(</span><br><span class="line">                query, key, value, <span class="variable language_">self</span>.embed_dim, <span class="variable language_">self</span>.num_heads,</span><br><span class="line">                <span class="variable language_">self</span>.in_proj_weight, <span class="variable language_">self</span>.in_proj_bias,</span><br><span class="line">                <span class="variable language_">self</span>.bias_k, <span class="variable language_">self</span>.bias_v, <span class="variable language_">self</span>.add_zero_attn,</span><br><span class="line">                <span class="variable language_">self</span>.dropout, <span class="variable language_">self</span>.out_proj.weight, <span class="variable language_">self</span>.out_proj.bias,</span><br><span class="line">                training=<span class="variable language_">self</span>.training,</span><br><span class="line">                key_padding_mask=key_padding_mask, need_weights=need_weights,</span><br><span class="line">                attn_mask=attn_mask,</span><br><span class="line">                use_separate_proj_weight=<span class="literal">True</span>,</span><br><span class="line">                q_proj_weight=<span class="variable language_">self</span>.q_proj_weight, k_proj_weight=<span class="variable language_">self</span>.k_proj_weight,</span><br><span class="line">                v_proj_weight=<span class="variable language_">self</span>.v_proj_weight,</span><br><span class="line">                average_attn_weights=average_attn_weights,</span><br><span class="line">                is_causal=is_causal)</span><br><span class="line">        <span class="keyword">else</span>:</span><br><span class="line">            attn_output, attn_output_weights = F.multi_head_attention_forward(</span><br><span class="line">                query, key, value, <span class="variable language_">self</span>.embed_dim, <span class="variable language_">self</span>.num_heads,</span><br><span class="line">                <span class="variable language_">self</span>.in_proj_weight, <span class="variable language_">self</span>.in_proj_bias,</span><br><span class="line">                <span class="variable language_">self</span>.bias_k, <span class="variable language_">self</span>.bias_v, <span class="variable language_">self</span>.add_zero_attn,</span><br><span class="line">                <span class="variable language_">self</span>.dropout, <span class="variable language_">self</span>.out_proj.weight, <span class="variable language_">self</span>.out_proj.bias,</span><br><span class="line">                training=<span class="variable language_">self</span>.training,</span><br><span class="line">                key_padding_mask=key_padding_mask,</span><br><span class="line">                need_weights=need_weights,</span><br><span class="line">                attn_mask=attn_mask,</span><br><span class="line">                average_attn_weights=average_attn_weights,</span><br><span class="line">                is_causal=is_causal)</span><br><span class="line">        <span class="keyword">if</span> <span class="variable language_">self</span>.batch_first <span class="keyword">and</span> is_batched:</span><br><span class="line">            <span class="keyword">return</span> attn_output.transpose(<span class="number">1</span>, <span class="number">0</span>), attn_output_weights</span><br><span class="line">        <span class="keyword">else</span>:</span><br><span class="line">            <span class="keyword">return</span> attn_output, attn_output_weights</span><br><span class="line">    <span class="comment">#......</span></span><br></pre></td></tr></table></figure><p>最後看 <a href="https://github.com/pytorch/pytorch/blob/main/torch/nn/functional.py"><code>F.multi_head_attention_forward</code></a>。關鍵在這段：PyTorch 會先用 <code>embed_dim // num_heads</code> 算出每個 head 分到的維度，並 assert 這個拆分必須剛好整除。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">multi_head_attention_forward</span>(<span class="params"></span></span><br><span class="line"><span class="params">    query: Tensor,</span></span><br><span class="line"><span class="params">    key: Tensor,</span></span><br><span class="line"><span class="params">    value: Tensor,</span></span><br><span class="line"><span class="params">    embed_dim_to_check: <span class="built_in">int</span>,</span></span><br><span class="line"><span class="params">    num_heads: <span class="built_in">int</span>,</span></span><br><span class="line"><span class="params">    <span class="comment">#......</span></span></span><br><span class="line"><span class="params">    </span>)</span><br><span class="line">    <span class="comment">#......</span></span><br><span class="line">    <span class="keyword">else</span>:</span><br><span class="line">        head_dim = embed_dim // num_heads</span><br><span class="line">    <span class="keyword">assert</span> head_dim * num_heads == embed_dim, <span class="string">f&quot;embed_dim <span class="subst">&#123;embed_dim&#125;</span> not divisible by num_heads <span class="subst">&#123;num_heads&#125;</span>&quot;</span></span><br><span class="line">    <span class="comment">#......</span></span><br><span class="line">    q = q.view(bsz, num_heads, tgt_len, head_dim)</span><br><span class="line">        k = k.view(bsz, num_heads, src_len, head_dim)</span><br><span class="line">        v = v.view(bsz, num_heads, src_len, head_dim)</span><br><span class="line"></span><br><span class="line">        attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)</span><br><span class="line">        attn_output = attn_output.permute(<span class="number">2</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">3</span>).contiguous().view(bsz * tgt_len, embed_dim)</span><br><span class="line"></span><br><span class="line">        attn_output = linear(attn_output, out_proj_weight, out_proj_bias)</span><br><span class="line">        attn_output = attn_output.view(tgt_len, bsz, attn_output.size(<span class="number">1</span>))</span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> is_batched:</span><br><span class="line">            <span class="comment"># squeeze the output if input was unbatched</span></span><br><span class="line">            attn_output = attn_output.squeeze(<span class="number">1</span>)</span><br><span class="line">        <span class="keyword">return</span> attn_output, <span class="literal">None</span></span><br></pre></td></tr></table></figure><p>也就是說，PyTorch 的實作會把 Q、K、V reshape 成 4 維：</p><ul><li><code>batch size</code></li><li><code>number of head</code></li><li><code>target/source length</code></li><li><code>head dimension</code></li></ul><p>原本的 feature size 會被拆成 <code>number of head * head dimension</code>。如果 <code>embed_dim</code> 不能被 <code>num_heads</code> 整除，就無法平均 reshape，因此會直接 assert。</p><p>簡單來說，<code>torch.nn.Transformer</code> 的 multi-head 實作不是把完整 input 重複餵給每個 head，而是將 feature size 平均拆成多份，每份交給不同 head 計算，最後再接回來。這種做法可以大幅節省運算量與記憶體使用，但代價就是 <code>d_model</code> 必須被 <code>nhead</code> 整除。</p><h2 id="小結"><a href="#小結" class="headerlink" title="小結"></a>小結</h2><p>這個例子展示了理論描述與工程實作之間的差異。理論上 Multi-Head Attention 可以用比較抽象的方式理解，但在框架實作中，為了讓張量 reshape、batch 運算與 GPU 加速更有效率，會加入更明確的維度限制。</p><p>這也是讀 framework source code 很有價值的地方：除了理解模型，也能看到實作用哪些假設換取效能與穩定性。</p>]]>
    </content>
    <id>https://www.threemonth03.com/2023/07/14/2023-07-14-torch.nn.transformer%E7%9A%84%E7%B4%B0%E7%AF%80/</id>
    <link href="https://www.threemonth03.com/2023/07/14/2023-07-14-torch.nn.transformer%E7%9A%84%E7%B4%B0%E7%AF%80/"/>
    <published>2023-07-14T09:29:00.000Z</published>
    <summary>
      <![CDATA[<span id="more"></span>

<h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>這篇筆記整理 <code>torch.nn.Transformer</code> 中一個常]]>
    </summary>
    <title>為什麼torch.nn.transformer中每個input的feature size需要是head數量的倍數</title>
    <updated>2026-05-10T14:09:36.003Z</updated>
  </entry>
  <entry>
    <author>
      <name>張君實</name>
    </author>
    <category term="論文筆記" scheme="https://www.threemonth03.com/categories/%E8%AB%96%E6%96%87%E7%AD%86%E8%A8%98/"/>
    <category term="DNN" scheme="https://www.threemonth03.com/tags/DNN/"/>
    <category term="Computer Architecture" scheme="https://www.threemonth03.com/tags/Computer-Architecture/"/>
    <content>
      <![CDATA[<span id="more"></span><h2 id="Introduction-to-Hardware"><a href="#Introduction-to-Hardware" class="headerlink" title="Introduction to Hardware"></a>Introduction to Hardware</h2><p>DNN 常見的運算包含 Convolution 與 FC Layer，這兩類 Layer 的核心多半是大量點積運算，而點積又會仰賴 MAC（Multiply-Accumulate，乘法累加器）。在硬體裡，MAC 通常會放在 ALU 中，因此一塊運算單元會配置不少 ALU 來支撐高吞吐量。</p><p>這篇 survey 將 DNN 運算硬體分成兩大類：</p><ul><li>Temporal Architecture：代表是 CPU、GPU。</li><li>Spatial Architecture：代表是 ASIC、FPGA 這類 DNN accelerator。</li></ul><img src="https://i.imgur.com/4xaAk1x.png" alt="Temporal Architecture and Spatial Architecture" title="Temporal Architecture and Spatial Architecture"><h3 id="Temporal-Architecture-and-Spatial-Architecture"><a href="#Temporal-Architecture-and-Spatial-Architecture" class="headerlink" title="Temporal Architecture and Spatial Architecture"></a>Temporal Architecture and Spatial Architecture</h3><p>Temporal Architecture 主要包含 CPU 與 GPU。這類架構通常透過 SIMD、SIMT 與演算法最佳化來縮短運算時間。</p><p>Spatial Architecture 則比較接近加速器的設計思路，常見實作是 ASIC 或 FPGA。它會將 Register File 與 Control Logic 放在 ALU 附近，藉此減少資料搬移，降低能耗。</p><h2 id="Temporal-Architecture"><a href="#Temporal-Architecture" class="headerlink" title="Temporal Architecture"></a>Temporal Architecture</h2><p>Temporal Architecture 的重點比較偏向「如何把 DNN 運算轉成適合 CPU&#x2F;GPU 執行的形式」。</p><h3 id="FC-Layer-and-Convolutional-Layer"><a href="#FC-Layer-and-Convolutional-Layer" class="headerlink" title="FC Layer and Convolutional Layer"></a>FC Layer and Convolutional Layer</h3><p>FC Layer 本質上就是 2 維 Input 與 Filter 的內積。如果 Input 原本不是 2 維，例如包含 Channel 或 Batch Size，就會先 flatten 成 2 維矩陣，其中一維通常是 Batch Size。轉成矩陣乘法後，CPU&#x2F;GPU 就可以用 SIMD 或 SIMT 來加速。</p><img src="https://i.imgur.com/MQJIFM5.png" alt="FC Layer Operation" title="FC Layer Operation"><p>Convolutional Layer 也可以用類似方式處理：把 convolution kernel 攤成 1 維陣列，再把 Input 攤成對應的 2 維陣列，最後轉成矩陣內積。缺點是轉換後的 Input feature map 會變大，尤其在多個 Filter 同時計算時，Input feature map 會被重複展開。</p><img src="https://i.imgur.com/qv6b0rV.png" alt="FC Layer Operation" title="Convolutional Layer Operation"><h3 id="Convolution-Optimization"><a href="#Convolution-Optimization" class="headerlink" title="Convolution Optimization"></a>Convolution Optimization</h3><p>作者提到兩種常見的矩陣內積最佳化方法：</p><ul><li>FFT：適合把 convolution 轉到頻域計算。</li><li>Strassen Algorithm：適合加速矩陣乘法。</li></ul><p>FFT 的複雜度可從 <code>O(No^2 * Nf^2)</code> 降到 <code>O(No^2 * log No)</code>，其中 <code>Nf</code> 是 Filter 的長寬，<code>No</code> 是 Output 的長寬。不過 <code>Nf^2</code> 不一定比 <code>log No</code> 大，因此作者建議 Filter 較大時，例如 <code>filter &gt; 5 * 5</code>，再考慮 FFT。</p><p>Strassen Algorithm 則是把矩陣乘法從 <code>O(N^3)</code> 降到 <code>O(N^(log7/log2))</code>。由於 convolution 轉矩陣乘法時，Filter 越大，展開後的 Input 越容易膨脹，所以 Strassen 比較適合 Filter 較小的情境，例如 <code>filter &lt; 3 * 3</code>。</p><h2 id="Spatial-Architecture"><a href="#Spatial-Architecture" class="headerlink" title="Spatial Architecture"></a>Spatial Architecture</h2><p>MAC 運作時會從 DRAM 讀取資料、進行乘法與加法，最後再把結果寫回 DRAM。從能耗比較可以看到，DRAM 存取的成本是 ALU 運算的上百倍。</p><p>因此 Spatial Architecture 的目標是減少資料搬移。它會把 Register File 與 Control Logic 放在 ALU 旁邊，組成 Processing Engine（PE），讓資料能在更靠近運算單元的位置被重複使用。</p><img src="https://i.imgur.com/1MPRJW0.png" alt="Energy Consumption of Component" title="Energy Consumption of Component"><h3 id="Data-Reuse"><a href="#Data-Reuse" class="headerlink" title="Data Reuse"></a>Data Reuse</h3><p>PE 中的 Register File 離 ALU 很近，因此設計重點會變成如何提高 Data Reuse。作者將 Data Reuse 分成三種：</p><ul><li>Convolutional Reuse：Convolution 運算中，Filter 與 Input 都會被重複使用。</li><li>Fmap Reuse：一筆 Feature Map 資料可能被多個 Filter 使用。</li><li>Filter Reuse：多筆資料可能共用同一個 Filter，常見於 <code>Batch Size &gt; 1</code>。</li></ul><img src="https://i.imgur.com/1Yww7TW.png" alt="Data Reuse" title="Data Reuse"><h3 id="Term-Explanation"><a href="#Term-Explanation" class="headerlink" title="Term Explanation"></a>Term Explanation</h3><p>在介紹 AI 加速器的種類前，先整理幾個常用術語。</p><p>一般電腦程式會先透過 Compiler 編譯成 Binary code，再交給 Processor 執行；Processor 的結構通常稱為 Architecture。</p><p>AI 加速器的流程很像：DNN 會先透過 Mapper 轉成對應的 Mapping，再交給 DNN Accelerator 執行；DNN Accelerator 的資料流設計則稱為 Dataflow。</p><img src="https://i.imgur.com/MpEw1mH.png" alt="Energy Consumption of Component" title="Energy Consumption of Component"><h3 id="Dataflow"><a href="#Dataflow" class="headerlink" title="Dataflow"></a>Dataflow</h3><p>AI 加速器的 Dataflow 主要可以分成 4 種：</p><h4 id="Weight-Stationary（WS）"><a href="#Weight-Stationary（WS）" class="headerlink" title="Weight Stationary（WS）"></a>Weight Stationary（WS）</h4><p>WS 的目標是最小化讀取 Weights 的能耗，也就是盡量讓 Weight 留在 Register File 中重複使用。</p><img src="https://i.imgur.com/HkZkjv8.png" alt="WS Dataflow" title="WS Dataflow"><h4 id="Output-Stationary（OS）"><a href="#Output-Stationary（OS）" class="headerlink" title="Output Stationary（OS）"></a>Output Stationary（OS）</h4><p>OS 的目標是最小化讀寫 Partial Sum 的能耗，也就是盡量讓 Partial Sum 留在 Register File 中。</p><img src="https://i.imgur.com/AKHLFT0.png" alt="OS Dataflow" title="OS Dataflow"><p>OS Dataflow 又可以依照 Channel 與 Activation 的數量分成 OSA、OSB、OSC。OSA 主要處理 Convolution，OSC 主要處理 FC Layer，OSB 則介於兩者之間。</p><img src="https://i.imgur.com/mVeGFkX.png" alt="OS Dataflow Detail" title="OS Dataflow Detail"><h4 id="No-Local-Reuse（NLR）"><a href="#No-Local-Reuse（NLR）" class="headerlink" title="No Local Reuse（NLR）"></a>No Local Reuse（NLR）</h4><p>Register File 可以減少能耗，但會增加面積。NLR 則反過來弱化 local reuse，目標是最大化 Global Buffer 的儲存能力，並最小化 Off-Chip Memory Bandwidth。</p><img src="https://i.imgur.com/YZ3mI0R.png" alt="NLR Dataflow" title="NLR Dataflow"><h4 id="Row-Stationary（RS）"><a href="#Row-Stationary（RS）" class="headerlink" title="Row Stationary（RS）"></a>Row Stationary（RS）</h4><p>RS 的目標是最大化所有類型資料在 Register File 中的重複使用機會。以 1 維 Convolution 為例，PE 中的 Filter 幾乎固定不動，只位移 Input 與 Partial Sum 來完成運算。</p><img src="https://i.imgur.com/WJsCaqB.png" alt="RS Dataflow(1)" title="RS Dataflow(1)"><p>2 維陣列的概念也類似，只是會使用多個 PE。Filter 仍然維持相對固定，每次計算 Row 時主要位移 Input 與 Partial Sum。</p><img src="https://i.imgur.com/6OfyrSJ.png" alt="RS Dataflow(2)" title="RS Dataflow(2)"><p>如果有多個 Channel 或 Batch Size 大於 1，則可以透過連接與交錯的方式得到對應輸出。</p><img src="https://i.imgur.com/5evaqwN.png" alt="RS Dataflow(3)" title="RS Dataflow(3)"><h3 id="Dataflow-Example"><a href="#Dataflow-Example" class="headerlink" title="Dataflow Example"></a>Dataflow Example</h3><p>作者使用 Eyeriss DNN Accelerator 作為範例，它的 PE Array 大小為 <code>12 * 14</code>。</p><img src="https://i.imgur.com/v6eyINS.png" alt="Eyeriss DNN accelerator" title="Eyeriss DNN accelerator"><p>這時會遇到兩個問題：</p><ol><li>PE Array 和 Layer 大小不同：Layer 較小時，可以一次塞多張 Layer 到 PE Array；Layer 較大時，則可以透過裁切或 folding 塞進 PE Array。</li></ol><img src="https://i.imgur.com/7VhpHmZ.png" alt="Replication & Folding" title="Replication & Folding"><ol start="2"><li>資料不知道要傳到哪個 PE：可以使用 Multicast Network 解決。最簡單的做法是廣播資料，再由 PE Array 中的 Control Logic 篩選每個 PE 需要的資料。</li></ol><h3 id="Dataflow-Comparision"><a href="#Dataflow-Comparision" class="headerlink" title="Dataflow Comparision"></a>Dataflow Comparision</h3><p>接著比較 WS、OS、NLR 與 RS 的能耗表現。</p><p>在 Convolutional Layer 中，RS 因為最大化 Register File 中的資料重複使用，所以 RF 能耗較高，但整體能耗最低。NLR 因為沒有 RF，資料都放在 Buffer，因此 Buffer 能耗最高。OSA 專門處理 Convolution，所以能耗也比 OSC 更低。</p><img src="https://i.imgur.com/ksk3LMN.png" alt="Energy consumption of Convotional Layer(1)" title="Energy consumption of Convotional Layer(1)"><p>從另一個角度看，WS 因為最大化 Weight reuse，所以 Weight 能耗最低；OS 因為最大化 Partial Sum reuse，所以 Partial Sum 能耗最低。</p><img src="https://i.imgur.com/YmpSgeC.png" alt="Energy consumption of Convotional Layer(2)" title="Energy consumption of Convotional Layer(2)"><p>在 FC Layer 中，OSC 因為更適合 FC Layer，所以能耗比 OSA 小。</p><img src="https://i.imgur.com/Y9osx8H.png" alt="Energy consumption of FC Layer" title="Energy consumption of FC Layer"><p>最後一張圖是使用 RS Dataflow 跑 AlexNet 的能耗分析。L1 ~ L5 多為 Convolutional Layer，RF 能耗較高；L6 ~ L8 多為 FC Layer，DRAM 能耗較高。整體來看，L1 ~ L5 消耗了大部分能量，而後續神經網路也越來越偏向大量使用 Convolution，因此改善 Convolution 的資料搬移與重複使用會非常重要。</p><img src="https://i.imgur.com/TWA5HA8.png" alt="Energy consumption of AlexNet" title="Energy consumption of AlexNet"><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><a href="https://arxiv.org/pdf/1703.09039.pdf">Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Part V</a></li><li><a href="https://zhuanlan.zhihu.com/p/300603589">FFT Convolution 參考</a></li><li><a href="https://www.csie.ntu.edu.tw/~wcchen/algorithm/strassen/strassen.html">Strassen Algorithm 參考</a></li></ul>]]>
    </content>
    <id>https://www.threemonth03.com/2023/06/21/2023-06-21-Efficient-Processing-of-Deep-Neural-Networks%E5%B0%8F%E7%B5%90%E7%B2%BE%E8%AE%80/</id>
    <link href="https://www.threemonth03.com/2023/06/21/2023-06-21-Efficient-Processing-of-Deep-Neural-Networks%E5%B0%8F%E7%B5%90%E7%B2%BE%E8%AE%80/"/>
    <published>2023-06-21T02:06:00.000Z</published>
    <summary>
      <![CDATA[<span id="more"></span>

<h2 id="Introduction-to-Hardware"><a href="#Introduction-to-Hardware" class="headerlink" title="Introduction to Har]]>
    </summary>
    <title>Efficient Processing of Deep Neural Networks, A Tutorial and Survey小結精讀</title>
    <updated>2026-05-10T14:08:50.819Z</updated>
  </entry>
</feed>
