[논문] Attention Is All You Need (2017)

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Today

Total

관리 메뉴

Just Fighting

[논문] Attention Is All You Need (2017) - 3 본문

카테고리 없음

[논문] Attention Is All You Need (2017) - 3

yennle 2024. 12. 4. 19:44

728x90

3.3. Position-wise Feed-Forward Networks

인코더, 디코더의 각각의 레이어는 각 포지션에 따로 그리고 동일하게 적용되는

'fully connected feed-forward network 완전 연결 피드 포워드 네트워크'를 포함한다.

FFN은 두 번의 선형 변환과 그 사이 ReLU 활성화로 구성된다.

$F F N (x) = m a x (0, x W 1 + b 1) W 2 + b 2 F F N (x) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2} <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>F</mi><mi>F</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>x</mi><msub><mi>W</mi><mn>1</mn></msub><mo>+</mo><msub><mi>b</mi><mn>1</mn></msub><mo stretchy="false">)</mo><msub><mi>W</mi><mn>2</mn></msub><mo>+</mo><msub><mi>b</mi><mn>2</mn></msub></math>$

선형 변환이 다른 위치에서 동일하게 일어나지만, 각 층마다 다른 파라미터를 사용함.

이것은 커널 사이즈가 1인 두 개의 컨볼루션이라고 묘사할 수 있음.

선형 변환과 커널 사이즈 1인 컨볼루션

사진의 가운데 2X2의 행렬이 커널.

커널 사이즈가 1이면 저 출력(노란색 행렬)도 입력(파란색 행렬)과 같이 3X3의 크기를 가지게 됨.

여기서, 커널이 1X1이면 입력에 가중치가 곱해진 형태로 출력이 나오게 되는데, 그것이 선형 변환임.

입력과 출력의 차원은 $d_{model} = 512$ 이고 내부 레이어의 차원은 $d_{ff}=2048$

입력 문장의 길이를 $seq\_len$ 이라고 하면

multi-head attention의 결과값인 행렬의 크기는 $(seq\_len, d_{model})$

첫 번째 가중치 행렬의 크기는 $(d_{model}, d_{ff})$

그럼, 첫 번째 선형 변환의 결과의 크기는 $(seq\_len, d_{ff})$

ReLU 활성화 함수를 거친 뒤, 두 번째 선형 변환 입력값의 크기는 $(seq\_len, d_{ff})$

두 번째 가중치 행렬의 크기는 $(d_{ff}, d_{model})$

따라서, 두 번째 선형 변환의 결과의 크기는 $(seq\_len, d_{model})$

3.4. Embeddings and Softmax

입력값 토큰과 출력값 토근을 $d_{model}$ 의 차원으로 만드는데, 학습된 임베딩 사용

또한, 디코더의 출력값을 predicted next-token probabilities예측된 다음 토큰 확률로 변환하기 위해

선형 변환과 소프트맥스 함수를 사용함.

두 개의 임베딩 레이어와 소프트맥스 전의 선형 변환은 동일한 가중치 행렬을 공유한다.

(Using the output embedding to improve language models 논문에서 제안한 방식과 유사)

임베딩 레이어에서는 가중치에 $\sqrt{d_{model}}$ 를 곱해서 사용한다.

3.5. Positional Encoding

트랜스포머는 반복과 컨볼루션이 없기 때문에, 모델이 시퀀스의 순서를 이용하기 위해서는

시퀀스 내에서 토큰의 상대적이고 절대적인 위치에 대한 약간의 정보를 주입시켜야 함.

=> "Positional Encoding"을 인코더, 디코더 아래 input embedding에 추가

포지셔널 인코딩은 임베딩과 동일한 차원 $d_{model}$ 을 가짐

그래서 합(+)이 가능한 것.

포지셔널 인코딩은 learned positional encoding과 fixed positional encoding 등 다양한 방식이 있음.

( Convolutional sequence to sequence learning 논문 참고)

이 연구에서는 다른 주파수의 사인, 코사인 함수를 이용

$P E (p o s, 2 i) = s i n (p o s / 10000 2 i / d m o d e l) P E (p o s, 2 i + 1) = c o s (p o s / 10000 2 i / d m o d e l) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mi>P</mi><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mn>2</mn><mi>i</mi><mo stretchy="false">)</mo></mrow></msub><mo>=</mo><mi>s</mi><mi>i</mi><mi>n</mi><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msup><mn>10000</mn><mrow data-mjx-texclass="ORD"><mn>2</mn><mi>i</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><mi>P</mi><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mn>2</mn><mi>i</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msub><mo>=</mo><mi>c</mi><mi>o</mi><mi>s</mi><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msup><mn>10000</mn><mrow data-mjx-texclass="ORD"><mn>2</mn><mi>i</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup><mo stretchy="false">)</mo></mtd></mtr></mtable></math>$