[CS231n] Assignment 1, Implement of vectorized lienar svm

티스토리 뷰

개인 공부

[CS231n] Assignment 1, Implement of vectorized lienar svm

dasu 2023. 1. 4. 19:31

Introduction

CS231n 과제 1번의 두번째가 Linear SVM을 구현하는 것입니다.

편의상 Delta=1로 두고 구현을 하고 있고, 이거 감안해서 봐주시면 감사하겠습니다.

Forward Pass와 Backpropagation을 구현하는 것이 핵심 토픽인데, 반복문 버전이랑 벡터화 버전이 있습니다.

반복문 버전은 뭐... 쉽게 할 수 있으니 넘어가고, 벡터화버전에 대해서 알아보도록 하겠습니다.

Forward Pass (Using iteration)

먼저, 반복문 버전의 코드부터 봅시다.

일단 loss는 Margin들의 합임을 알 수가 있고, margin을 계산하기 위해서는 정답의 값이 필요함을 알 수 있습니다.

코드를 보면, class 개수만큼 반복문을 돌면서 margin을 계산하는 것을 알 수가 있죠.

반대로 말하면, $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ (train)이 고정되어 있는 한 correct_class_score는 변하지 않음을 알 수 있습니다. 고정이 된다는 것이죠.

이걸 나중에 행렬곱으로 표현할 필요가 있습니다.

이렇게 loss들의 합을 구하고 나면, 마지막으로는 regularzation에 대한 것도 구해야합니다.

이건 앞의 Batch normalization backpropagation을 설명하며 사용한 방법을 재활용할 예정입니다.

$f (W) = W 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>W</mi><mo stretchy="false">)</mo><mo>=</mo><msup><mi>W</mi><mn>2</mn></msup></math>$ , 이 때 $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ 는 element-wise operation이라고 생각하면 $r e g \times \sum i \sum j W 2 i j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mi>e</mi><mi>g</mi><mo>\times</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><munder><mo data-mjx-texclass="OP">\sum</mo><mi>j</mi></munder><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow><mn>2</mn></msubsup></math>$ 은 $r e g \times \sum i \sum j f (W) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mi>e</mi><mi>g</mi><mo>\times</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><munder><mo data-mjx-texclass="OP">\sum</mo><mi>j</mi></munder><mi>f</mi><mo stretchy="false">(</mo><mi>W</mi><mo stretchy="false">)</mo></math>$ 로 쓸 수 있고, 모든 element를 다 더하는 것은 $1 f (W) 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mi>f</mi><mo stretchy="false">(</mo><mi>W</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow></math>$ 로 작성할 수 있습니다.

결국 Loss도 Scalar로 반환되기 때문에, Trace-trick을 사용해도 문제가 없습니다.

Forward Pass (Using vectorization)

이제, 벡터화하여 최대한 행렬곱으로 표현할 시간입니다!

먼저, margin이 0보다 클 때만 loss를 더하는 이 부분부터 처리해보도록 하죠.

Activation function을 배우신 분들은 이 과정이 ReLU 함수를 통과하는 과정이랑 굉장히 유사함을 알 수 있습니다.

$R e L U (x) = max (0, x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi><mi>e</mi><mi>L</mi><mi>U</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>x</mi><mo stretchy="false">)</mo></math>$ 로 표현 가능하므로, 우리도 마지막에 저 부분을 max operation을 통과함으로써 마무리 하면 되겠네요.

이러면 추후 Backpropagation할 때 ReLU를 미분해 줄 필요도 있는데, $x > 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>></mo><mn>0</mn></math>$ 이면 gradient를 1로, 그렇지 않으면 0으로 반환하는 방식으로 진행합니다. Python에서 boolean indexing을 진행하면 쉽게 할 수 있습니다.

다음 과정은 본격적으로 행렬 계산을 하는 부분입니다.

사실, score를 계산하는 과정은 굉장히 쉽습니다. 이건 그냥 Train set $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 와 Weight matrix $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 를 곱해주기만 하면 끝입니다. 따로 bias도 없기 때문에, 곱하기만 하면 됩니다.

하지만, 여기에서 정답 벡터만 추출하는 과정이 쉽지 않습니다. 앞의 Iteration 과정에서 말했듯이, $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ 가 고정되어 있는 한 correct_class_score는 변하지 않으므로 정답 matrix는 $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 첫 번째 인덱스에만 의존적입니다. 이걸 구현할 수 있어야 합니다!

한 가지 묘수를 생각해봅시다.

우리는 현재 정답이 어디있는지를 알고 있습니다. 그러면, 정답을 표시하는 것 외에도 정답 인덱스에만 값을 할당하는 것이 가능합니다.

예를 들어, 정답을 표시한 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 가 $y = [102] T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><msup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>1</mn></mtd><mtd><mn>0</mn></mtd><mtd><mn>2</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow><mi>T</mi></msup></math>$ 라고 합시다.

그러면, 우리는 순수하게 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 를 사용하는 것이 아니라, $ˆ y = [010100001] <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>0</mn></mtd><mtd><mn>1</mn></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd><mtd><mn>1</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ 를 만들어서 사용하자는 것입니다.

이렇게 되면, 총 $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 개의 Train set에 대해 표시할 수 있는 Class의 개수가 $C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>C</mi></math>$ 개 있으므로 총 $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 크기의 Matrix를 가지게 됩니다. 이렇게 만들어진 행렬을 편의상 $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 라고 하겠습니다.

이제, 우리가 들고 있는 score 행렬에 이 $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 를 Hadamard product를 하게 되면 정답만이 남아있게 되고, 그렇지 않은 것들은 전부 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 으로 바뀌게 됩니다.

하지만, 우리가 최종적으로 원하는 꼴은 이게 아니죠. 결국은 우리는 행렬의 각 행이 전부 정답으로만 도배가 되어 있어야 합니다.

예를 좀 들면, 현재 상황은 $[030 - 0.7 00 00 0.4] <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>0</mn></mtd><mtd><mn>3</mn></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mo>-</mo><mn>0.7</mn></mtd><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd><mtd><mn>0.4</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ 뭐 이런 상황인 것이구요, 우리가 필요한 행렬은 $[333 - 0.7 - 0.7 - 0.7 0.4 0.4 0.4] <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>3</mn></mtd><mtd><mn>3</mn></mtd><mtd><mn>3</mn></mtd></mtr><mtr><mtd><mo>-</mo><mn>0.7</mn></mtd><mtd><mo>-</mo><mn>0.7</mn></mtd><mtd><mo>-</mo><mn>0.7</mn></mtd></mtr><mtr><mtd><mn>0.4</mn></mtd><mtd><mn>0.4</mn></mtd><mtd><mn>0.4</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ 이것입니다.

이걸 만들기 위해서, $N \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mn>1</mn></math>$ 의 벡터로 만들고 이걸 다시 $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 로 확장한다는 느낌으로 갑시다.

이때, $N \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mn>1</mn></math>$ 을 만들 땐 각 행의 합이 필요합니다. 이건 앞에서도 많이 사용했다시피 모든 항의 값이 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 인 벡터를 사용하면 가능합니다. 이런게 가로로 $C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>C</mi></math>$ 개 있으면, $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 의 행렬을 만들 수 있겠죠.

즉, 우리는 $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 에 모든 항이 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 인 $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 크기의 행렬을 우측에 곱해주면 됩니다.

이후, $δ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>δ</mi></math>$ 값을 더하고, 정답 라벨의 Score을 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 으로 만든 후, max operation을 거쳐주면 되는 것이죠.

이걸 코드로 구현한 것은 다음과 같습니다.

Implement of Forward Pass (Using vectorization)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

def ForwardPass(W, X, y, reg):
    """
    Structured SVM loss function, naive implementation (with loops).
 
    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.
 
    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength
 
    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    loss = 0.0
 
    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the structured SVM loss, storing the    #
    # result in loss.                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
 
    N = X.shape[0]
    C = W.shape[1]
    ones = np.ones((C, C))
    tp = np.arange(N)
    score = X@W
    ans = np.zeros_like(score)
    ans[tp, y] = 1
    score = score - (score * ans)@ones + 1
    score[tp, y] = 0
    score[score <= 0] = 0
    loss += np.sum(score) / N
    loss += reg * np.sum(W * W)
 
    return loss
Colored by Color Scripter

cs

Backpropagation (Using vectorization)

자, Forward Pass를 깔끔하게 마무리했으므로 Backpropagation도 생각보다 쉽게 할 수 있습니다!

현재, Loss가 계산되는 과정을 좀 적어보자면 $L=1N∑i∑jmax(0,score)ij+λ‖W‖22<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><mstyle displaystyle="true" scriptlevel="0"><mfrac><mn>1</mn><mi>N</mi></mfrac></mstyle><munder><mo data-mjx-texclass="OP">∑</mo><mi>i</mi></munder><munder><mo data-mjx-texclass="OP">∑</mo><mi>j</mi></munder><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub><mo>+</mo><mi>λ</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">‖</mo><mi>W</mi><msubsup><mo data-mjx-texclass="ORD" fence="false" stretchy="false">‖</mo><mn>2</mn><mn>2</mn></msubsup></math>$

입니다. $s c o r e <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi></math>$ 행렬은 $X W - (X W ⊙ ˆ y) \cdot 1 (C, C) + 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mi>W</mi><mo>-</mo><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo>\cdot</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo>+</mo><mn>1</mn></math>$ 로, $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 의 크기를 가지고 있습니다.

앞에서도 말했다시피 $\sum i \sum j max (0, s c o r e) i j <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><munder><mo data-mjx-texclass="OP">\sum</mo><mi>j</mi></munder><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub></math>$ 는 $1 max (0, s c o r e) 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow></math>$ 로 적을 수 있고, Trace trick을 활용하여 넘겨주면 모든 항이 $1 / N <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>N</mi></math>$ 인 $N \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>C</mi></math>$ 행렬이 필요한 것 뿐입니다.

다음은 $max <math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="OP" movablelimits="true">max</mo></math>$ 를 처리해야합니다. 이때, max operation이 element-wise하게 이루어지므로 앞의 Matrix differentation rule에 의해 $d σ (X) = σ' (X) ⊙ d X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><mi>σ</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><msup><mi>σ</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>⊙</mo><mi>d</mi><mi>X</mi></math>$ 가 성립하게 됩니다. 이때, $max <math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="OP" movablelimits="true">max</mo></math>$ 의 Gradient(혹은 derivation)은 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 보다 크면 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ , 그렇지 않으면 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 을 뿜어내므로 우리는 값을 들고 있는 행렬이 필요합니다.

정말 다행히도 이 행렬은 score 행렬이 됩니다. 즉, 우리는 score 행렬을 보면서 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 보다 큰 값을 가지고 있는 애들은 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 을, 그렇지 않으면 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 을 나타내게 행렬의 값을 바꿀 것입니다. 이 부분에서 Boolean indexing이 들어가게 됩니다.

이 행렬을 $g r a d m a x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math>$ 라고 칭하겠습니다.

그러면, 현재 상황을 Trace trick을 통해 나타내면 $d L = ⟨ g r a d m a x, d (s c o r e) ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>d</mi><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mo stretchy="false">(</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">⟩</mo></math>$ 입니다.

OK, 여기까지 왔으면 거의 끝났습니다.

score에 대한 Total derivate를 계산할 때 상수는 의미없으므로 날리고, 빼기로 연결되어 있으므로 각각 따로 계산합시다.

먼저, $⟨ g r a d m a x, d (X W) ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">⟩</mo></math>$ 부터 하죠. 굉장히 쉽게 계산이 됩니다. $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 와 $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 는 independent하므로, $d (X W) = X d W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo stretchy="false">)</mo><mo>=</mo><mi>X</mi><mi>d</mi><mi>W</mi></math>$ 입니다. 따라서, $⟨ g r a d m a x, d (X W) ⟩ = ⟨ X T \cdot g r a d m a x, d W ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">⟩</mo><mo>=</mo><mo fence="false" stretchy="false">⟨</mo><msup><mi>X</mi><mi>T</mi></msup><mo>\cdot</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mi>d</mi><mi>W</mi><mo fence="false" stretchy="false">⟩</mo></math>$ 입니다.

이제 뒷쪽을 처리하도록 하죠. $⟨gradmax,d{(XW⊙ˆy)⋅1(C,C)}⟩<math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo>⋅</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo data-mjx-texclass="CLOSE">}</mo></mrow><mo fence="false" stretchy="false">⟩</mo></math>$ 에서 뒷쪽의 $C \times C <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>C</mi><mo>\times</mo><mi>C</mi></math>$ 크기의 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow></math>$ 행렬을 먼저 곱한 후, element-wise operation을 좌측으로 옮깁시다. 마지막으로 직전에 했던 과정을 똑같이 거칩시다.

그러면, $⟨gradmax,d{(XW⊙ˆy)⋅1(C,C)}⟩=⟨gradmax⋅1(C,C),d(XW⊙ˆy)⟩=⟨(gradmax⋅1(C,C))⊙ˆy,d(XW)⟩=⟨XT⋅((gradmax⋅1(C,C))⊙ˆy),dW⟩<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo>⋅</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo data-mjx-texclass="CLOSE">}</mo></mrow><mo fence="false" stretchy="false">⟩</mo></mtd><mtd><mi></mi><mo>=</mo><mo fence="false" stretchy="false">⟨</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>⋅</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo fence="false" stretchy="false">⟩</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mo fence="false" stretchy="false">⟨</mo><mo stretchy="false">(</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>⋅</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">)</mo><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mo stretchy="false">(</mo><mi>X</mi><mi>W</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">⟩</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mo fence="false" stretchy="false">⟨</mo><msup><mi>X</mi><mi>T</mi></msup><mo>⋅</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>g</mi><mi>r</mi><mi>a</mi><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>⋅</mo><msub><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>C</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>C</mi><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">)</mo><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>d</mi><mi>W</mi><mo fence="false" stretchy="false">⟩</mo></mtd></mtr></mtable></math>$

입니다.

마지막으로, Regularization에 대한 Term만 고려해줍시다. 동일하게 $1 f (W) 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow><mi>f</mi><mo stretchy="false">(</mo><mi>W</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn mathvariant="bold">1</mn></mrow></math>$ 이므로, $N \times D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>D</mi></math>$ 크기의 모든 항이 $r e g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mi>e</mi><mi>g</mi></math>$ 인 행렬을 만든 후, $f' (W) = 2 W <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><mi>W</mi><mo stretchy="false">)</mo><mo>=</mo><mn>2</mn><mi>W</mi></math>$ 이므로 이걸 Hadamard product 진행해주면 끝입니다.

즉, Regularization에 대한 gradient는 $2 \times r e g ⊙ W <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>\times</mo><mi>r</mi><mi>e</mi><mi>g</mi><mo>⊙</mo><mi>W</mi></math>$ 가 되고, $r e g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mi>e</mi><mi>g</mi></math>$ 는 $N \times D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>D</mi></math>$ 의 행렬입니다.

앞의 내용을 코드로 나타내면 다음과 같습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

def Backprop(W, X, y, reg):
    """
    Structured SVM loss function, naive implementation (with loops).
 
    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.
 
    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength
 
    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    dW = np.zeros(W.shape)  # initialize the gradient as zero
 
    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the gradient for the structured SVM     #
    # loss, storing the result in dW.                                           #
    #                                                                           #
    # Hint: Instead of computing the gradient from scratch, it may be easier    #
    # to reuse some of the intermediate values that you used to compute the     #
    # loss.                                                                     #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    score_grad = np.copy(score)
    score_grad[score_grad > 0] = 1
    grad_max = score_grad / N
    dW = X.T @ grad_max - X.T @ ((grad_max@ones)*ans)
    dW += 2 * reg * W
 
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
 
    return dW
Colored by Color Scripter

cs

Outroduction

사실 선형대수 배워봤자 쓰잘데기 없다고 생각했는데, Trace trick 하나 알아두니까 Backprop 계산이 진짜 편하네요

행렬곱 같은것도 표현 좋고

기회될 때 선형대수 관련해서 업로드 좀 하겠습니다.

이번학기에 선형대수 들었는데, 사실 귀찮아서 안올리고 있었는데 배운 내용 함 정리할게요

저작자표시 비영리 변경금지

'개인 공부' 카테고리의 다른 글

Batch normalization Forward Pass & Backpropagation (0)	2023.01.02
Graph transformer networks based text representation (0)	2022.09.03
Softmax & Loss (0)	2022.08.13
B. (Variational) Auto Encoder (0)	2022.07.16
A. Attention (0)	2022.07.16

생각날때마다 올리는 블로그 dasu 님의 블로그입니다.

최근에 올라온 글

공지사항

Total

Today

Yesterday

최근에 달린 댓글

링크

TAG more

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

글 보관함

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

생각날때마다 올리는 블로그

티스토리 뷰