Transformer 구조 완벽 분석

자연어 처리의 혁명을 가져온 Transformer 아키텍처의 핵심 구성 요소를 코드로 구현하며 이해합니다. Attention 메커니즘부터 Multi-Head Attention, Positional Encoding까지 실제 동작하는 코드로 학습합니다.

카테고리:Python

언어:Python

메인 태그:#Python

서브 태그:

#Transformer#Attention#DeepLearning#NLP

들어가며

이 글에서는 Transformer 구조 완벽 분석에 대해 상세히 알아보겠습니다. 총 10가지 주요 개념을 다루며, 각각의 개념에 대한 설명과 실제 코드 예제를 함께 제공합니다.

1. Self Attention 메커니즘

개요

Query, Key, Value를 사용하여 입력 시퀀스 간의 관계를 계산하는 Self-Attention의 핵심 구현입니다.

코드 예제

import torch
import torch.nn as nn

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

설명

Query와 Key의 내적으로 유사도를 계산하고, 스케일링 후 softmax를 적용하여 가중치를 구합니다. 이 가중치로 Value를 가중합하여 최종 출력을 생성합니다.

2. Multi Head Attention

개요

여러 개의 Attention Head를 병렬로 실행하여 다양한 표현 공간에서 정보를 포착하는 Multi-Head Attention 구현입니다.

코드 예제

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

설명

d_model 차원을 num_heads로 나누어 각 헤드가 독립적으로 attention을 수행합니다. 여러 관점에서 문맥을 이해할 수 있게 됩니다.

3. Positional Encoding

개요

Transformer는 순서 정보가 없으므로 위치 정보를 sin/cos 함수로 인코딩하여 추가합니다.

코드 예제

import math

def positional_encoding(seq_len, d_model):
    PE = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
    PE[:, 0::2] = torch.sin(position * div_term)
    PE[:, 1::2] = torch.cos(position * div_term)
    return PE

설명

짝수 인덱스는 sin, 홀수 인덱스는 cos 함수를 사용하여 각 위치마다 고유한 벡터를 생성합니다. 이를 입력 임베딩에 더해 위치 정보를 제공합니다.

4. Feed Forward Network

개요

Attention 이후 적용되는 Position-wise Feed Forward Network로 각 위치마다 독립적으로 변환을 수행합니다.

코드 예제

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.dropout(self.relu(self.fc1(x))))

설명

두 개의 선형 변환과 ReLU 활성화로 구성되며, 중간 차원(d_ff)은 일반적으로 d_model의 4배입니다. 각 토큰에 동일한 변환을 적용합니다.

5. Layer Normalization과 Residual Connection

개요

학습 안정성을 위한 Layer Normalization과 정보 흐름을 원활하게 하는 Residual Connection 구현입니다.

코드 예제

class SublayerConnection(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

설명

Pre-LN 방식으로 정규화를 먼저 수행한 후 서브레이어를 통과시키고, 원본 입력과 더합니다. Gradient 흐름이 개선되어 깊은 네트워크 학습이 가능합니다.

6. Encoder Layer

개요

Multi-Head Attention과 Feed Forward Network를 결합한 완전한 Encoder Layer 구현입니다.

코드 예제

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
        self.sublayer1 = SublayerConnection(d_model, dropout)
        self.sublayer2 = SublayerConnection(d_model, dropout)

    def forward(self, x, mask=None):
        x = self.sublayer1(x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer2(x, self.feed_forward)

설명

Self-Attention과 Feed Forward를 각각 Residual Connection과 Layer Normalization으로 감싸 안정적인 학습을 보장합니다.

7. Masked Multi Head Attention

개요

Decoder에서 사용되는 Masked Attention으로 미래 토큰을 참조하지 못하도록 마스킹합니다.

코드 예제

def create_causal_mask(seq_len):
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    mask = mask.masked_fill(mask == 1, float('-inf'))
    return mask

def masked_attention(Q, K, V, mask=None):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(Q.size(-1), dtype=torch.float32))
    if mask is not None:
        scores = scores + mask
    return torch.matmul(torch.softmax(scores, dim=-1), V)

설명

상삼각 행렬로 마스크를 생성하여 현재 위치 이후의 토큰에 -inf를 부여합니다. Softmax 후 해당 위치의 가중치가 0이 되어 미래 정보 누출을 방지합니다.

8. Decoder Layer

개요

Masked Self-Attention, Encoder-Decoder Attention, Feed Forward로 구성된 Decoder Layer 구현입니다.

코드 예제

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
        self.sublayer = nn.ModuleList([SublayerConnection(d_model, dropout) for _ in range(3)])

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.cross_attn(x, encoder_output, encoder_output, src_mask))
        return self.sublayer[2](x, self.feed_forward)

설명

첫 번째 Attention은 타겟 시퀀스 내부 관계를 학습하고, 두 번째 Cross-Attention은 인코더 출력과의 관계를 학습합니다.

9. Complete Transformer Model

개요

Encoder와 Decoder를 결합한 완전한 Transformer 모델의 전체 구조입니다.

코드 예제

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8, num_layers=6, d_ff=2048):
        super().__init__()
        self.encoder_embed = nn.Embedding(src_vocab, d_model)
        self.decoder_embed = nn.Embedding(tgt_vocab, d_model)
        self.encoder = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
        self.decoder = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
        self.fc_out = nn.Linear(d_model, tgt_vocab)

설명

입력 임베딩, 위치 인코딩, 여러 층의 Encoder/Decoder, 최종 출력 레이어로 구성됩니다. 병렬 처리가 가능하여 RNN보다 훨씬 빠릅니다.

10. Attention Visualization

개요

Attention 가중치를 시각화하여 모델이 어떤 토큰에 집중하는지 분석하는 코드입니다.

코드 예제

import matplotlib.pyplot as plt

def visualize_attention(attention_weights, src_tokens, tgt_tokens):
    plt.figure(figsize=(10, 8))
    plt.imshow(attention_weights.detach().cpu().numpy(), cmap='viridis')
    plt.colorbar()
    plt.xticks(range(len(src_tokens)), src_tokens, rotation=45)
    plt.yticks(range(len(tgt_tokens)), tgt_tokens)
    plt.xlabel('Source')
    plt.ylabel('Target')
    plt.tight_layout()

설명

Attention 행렬을 히트맵으로 시각화하여 각 출력 토큰이 어떤 입력 토큰에 주목하는지 확인할 수 있습니다. 모델의 해석 가능성을 높입니다.

마치며

이번 글에서는 Transformer 구조 완벽 분석에 대해 알아보았습니다. 총 10가지 개념을 다루었으며, 각각의 사용법과 예제를 살펴보았습니다.

관련 태그

#Python #Transformer #Attention #DeepLearning #NLP

#Python#Transformer#Attention#DeepLearning#NLP

Transformer|구조|완벽|분석

# Transformer|구조|완벽|분석 자연어 처리의 혁명을 가져온 Transformer 아키텍처의 핵심 구성 요소를 코드로 구현하며 이해합니다. Attention 메커니즘부터 Multi-Head Attention, Positional Encoding까지 실제 동작하는 코드로 학습합니다. --- 카테고리: Python 언어: Python 태그: #Python, #Transformer, #Attention, #DeepLearning, #NLP --- ## 들어가며 이 글에서는 Transformer 구조 완벽 분석에 대해 상세히 알아보겠습니다. 총 10가지 주요 개념을 다루며, 각각의 개념에 대한 설명과 실제 코드 예제를 함께 제공합니다. ## 목차 1. [Self_Attention_메커니즘](#self_attention_메커니즘) 2. [Multi_Head_Attention](#multi_head_attention) 3. [Positional_Encoding](#positional_encoding) 4. [Feed_Forward_Network](#feed_forward_network) 5. [Layer_Normalization과_Residual_Connection](#layer_normalization과_residual_connection) 6. [Encoder_Layer](#encoder_layer) 7. [Masked_Multi_Head_Attention](#masked_multi_head_attention) 8. [Decoder_Layer](#decoder_layer) 9. [Complete_Transformer_Model](#complete_transformer_model) 10. [Attention_Visualization](#attention_visualization) --- ## 1. Self_Attention_메커니즘 ### 개요 Query, Key, Value를 사용하여 입력 시퀀스 간의 관계를 계산하는 Self-Attention의 핵심 구현입니다. ### 코드 예제 ```python import torch import torch.nn as nn def scaled_dot_product_attention(Q, K, V): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32)) attention_weights = torch.softmax(scores, dim=-1) output = torch.matmul(attention_weights, V) return output, attention_weights ``` ### 설명 Query와 Key의 내적으로 유사도를 계산하고, 스케일링 후 softmax를 적용하여 가중치를 구합니다. 이 가중치로 Value를 가중합하여 최종 출력을 생성합니다. --- ## 2. Multi_Head_Attention ### 개요 여러 개의 Attention Head를 병렬로 실행하여 다양한 표현 공간에서 정보를 포착하는 Multi-Head Attention 구현입니다. ### 코드 예제 ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) ``` ### 설명 d_model 차원을 num_heads로 나누어 각 헤드가 독립적으로 attention을 수행합니다. 여러 관점에서 문맥을 이해할 수 있게 됩니다. --- ## 3. Positional_Encoding ### 개요 Transformer는 순서 정보가 없으므로 위치 정보를 sin/cos 함수로 인코딩하여 추가합니다. ### 코드 예제 ```python import math def positional_encoding(seq_len, d_model): PE = torch.zeros(seq_len, d_model) position = torch.arange(0, seq_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) PE[:, 0::2] = torch.sin(position * div_term) PE[:, 1::2] = torch.cos(position * div_term) return PE ``` ### 설명 짝수 인덱스는 sin, 홀수 인덱스는 cos 함수를 사용하여 각 위치마다 고유한 벡터를 생성합니다. 이를 입력 임베딩에 더해 위치 정보를 제공합니다. --- ## 4. Feed_Forward_Network ### 개요 Attention 이후 적용되는 Position-wise Feed Forward Network로 각 위치마다 독립적으로 변환을 수행합니다. ### 코드 예제 ```python class PositionWiseFeedForward(nn.Module): def __init__(self, d_model, d_ff, dropout=0.1): super().__init__() self.fc1 = nn.Linear(d_model, d_ff) self.fc2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) self.relu = nn.ReLU() def forward(self, x): return self.fc2(self.dropout(self.relu(self.fc1(x)))) ``` ### 설명 두 개의 선형 변환과 ReLU 활성화로 구성되며, 중간 차원(d_ff)은 일반적으로 d_model의 4배입니다. 각 토큰에 동일한 변환을 적용합니다. --- ## 5. Layer_Normalization과_Residual_Connection ### 개요 학습 안정성을 위한 Layer Normalization과 정보 흐름을 원활하게 하는 Residual Connection 구현입니다. ### 코드 예제 ```python class SublayerConnection(nn.Module): def __init__(self, d_model, dropout=0.1): super().__init__() self.norm = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, sublayer): return x + self.dropout(sublayer(self.norm(x))) ``` ### 설명 Pre-LN 방식으로 정규화를 먼저 수행한 후 서브레이어를 통과시키고, 원본 입력과 더합니다. Gradient 흐름이 개선되어 깊은 네트워크 학습이 가능합니다. --- ## 6. Encoder_Layer ### 개요 Multi-Head Attention과 Feed Forward Network를 결합한 완전한 Encoder Layer 구현입니다. ### 코드 예제 ```python class EncoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout) self.sublayer1 = SublayerConnection(d_model, dropout) self.sublayer2 = SublayerConnection(d_model, dropout) def forward(self, x, mask=None): x = self.sublayer1(x, lambda x: self.self_attn(x, x, x, mask)) return self.sublayer2(x, self.feed_forward) ``` ### 설명 Self-Attention과 Feed Forward를 각각 Residual Connection과 Layer Normalization으로 감싸 안정적인 학습을 보장합니다. --- ## 7. Masked_Multi_Head_Attention ### 개요 Decoder에서 사용되는 Masked Attention으로 미래 토큰을 참조하지 못하도록 마스킹합니다. ### 코드 예제 ```python def create_causal_mask(seq_len): mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) mask = mask.masked_fill(mask == 1, float('-inf')) return mask def masked_attention(Q, K, V, mask=None): scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(Q.size(-1), dtype=torch.float32)) if mask is not None: scores = scores + mask return torch.matmul(torch.softmax(scores, dim=-1), V) ``` ### 설명 상삼각 행렬로 마스크를 생성하여 현재 위치 이후의 토큰에 -inf를 부여합니다. Softmax 후 해당 위치의 가중치가 0이 되어 미래 정보 누출을 방지합니다. --- ## 8. Decoder_Layer ### 개요 Masked Self-Attention, Encoder-Decoder Attention, Feed Forward로 구성된 Decoder Layer 구현입니다. ### 코드 예제 ```python class DecoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.cross_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout) self.sublayer = nn.ModuleList([SublayerConnection(d_model, dropout) for _ in range(3)]) def forward(self, x, encoder_output, src_mask, tgt_mask): x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)) x = self.sublayer[1](x, lambda x: self.cross_attn(x, encoder_output, encoder_output, src_mask)) return self.sublayer[2](x, self.feed_forward) ``` ### 설명 첫 번째 Attention은 타겟 시퀀스 내부 관계를 학습하고, 두 번째 Cross-Attention은 인코더 출력과의 관계를 학습합니다. --- ## 9. Complete_Transformer_Model ### 개요 Encoder와 Decoder를 결합한 완전한 Transformer 모델의 전체 구조입니다. ### 코드 예제 ```python class Transformer(nn.Module): def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8, num_layers=6, d_ff=2048): super().__init__() self.encoder_embed = nn.Embedding(src_vocab, d_model) self.decoder_embed = nn.Embedding(tgt_vocab, d_model) self.encoder = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]) self.decoder = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]) self.fc_out = nn.Linear(d_model, tgt_vocab) ``` ### 설명 입력 임베딩, 위치 인코딩, 여러 층의 Encoder/Decoder, 최종 출력 레이어로 구성됩니다. 병렬 처리가 가능하여 RNN보다 훨씬 빠릅니다. --- ## 10. Attention_Visualization ### 개요 Attention 가중치를 시각화하여 모델이 어떤 토큰에 집중하는지 분석하는 코드입니다. ### 코드 예제 ```python import matplotlib.pyplot as plt def visualize_attention(attention_weights, src_tokens, tgt_tokens): plt.figure(figsize=(10, 8)) plt.imshow(attention_weights.detach().cpu().numpy(), cmap='viridis') plt.colorbar() plt.xticks(range(len(src_tokens)), src_tokens, rotation=45) plt.yticks(range(len(tgt_tokens)), tgt_tokens) plt.xlabel('Source') plt.ylabel('Target') plt.tight_layout() ``` ### 설명 Attention 행렬을 히트맵으로 시각화하여 각 출력 토큰이 어떤 입력 토큰에 주목하는지 확인할 수 있습니다. 모델의 해석 가능성을 높입니다. --- ## 마치며 이번 글에서는 Transformer 구조 완벽 분석에 대해 알아보았습니다. 총 10가지 개념을 다루었으며, 각각의 사용법과 예제를 살펴보았습니다. ### 관련 태그 #Python #Transformer #Attention #DeepLearning #NLP

카테고리: Python

언어: Python

태그: Python, Transformer, Attention, DeepLearning, NLP

작성자: AI Generated

프리미엄 콘텐츠 - 3개월 무료 체험 가능

Transformer 구조 완벽 분석

들어가며

목차

1. Self Attention 메커니즘

개요

코드 예제

설명

2. Multi Head Attention

개요

코드 예제

설명

3. Positional Encoding

개요

코드 예제

설명

4. Feed Forward Network

개요

코드 예제

설명

5. Layer Normalization과 Residual Connection

개요

코드 예제

설명

6. Encoder Layer

개요

코드 예제

설명

7. Masked Multi Head Attention

개요

코드 예제

설명

8. Decoder Layer

개요

코드 예제

설명

9. Complete Transformer Model

개요

코드 예제

설명

10. Attention Visualization

개요

코드 예제

설명

마치며

관련 태그

댓글 (0)

함께 보면 좋은 카드 뉴스

vLLM 통합 완벽 가이드

Web UI Demo 구축 완벽 가이드

Sandboxing & Execution Control 완벽 가이드

Voice Design then Clone 워크플로우 완벽 가이드

Tool Use 완벽 가이드 - Shell, Browser, DB 실전 활용