Linear Algebra(3) - Dimension Reduction

2021. 1. 30. 10:23ㆍ[AI]/Data Science Fundamentals

<Learned Stuff>

[Dimension Reduction]

Key points

eigenvector / eigenvalue
dimension reduction

개념

One-Hot Encoding
PCA(Principal Component Analysis)

<New Stuff>

[Eigenvector / Eigenvalue]

$A\cdot\vec{x}=\lambda\cdot\vec{x}$

$\vec{x} :eigen\ vector $

$\lambda :eigen\ value \ (scalar)$

calculation ($A : 2 \times 2$ matrix 라고 가정):

$
\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}
\cdot
\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}
=
\lambda
\cdot
\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
\cdot
\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}
$

==>
$
\begin{bmatrix} a_{11}-\lambda & a_{12} \\ a_{21} & a_{22}-\lambda \end{bmatrix}
\cdot
\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}
=
\begin{bmatrix} 0 \\ 0 \end{bmatrix}
$

Independent columns : $Ax = 0$ has one solution, $A$ is invertible
Dependent columns : $Ax = 0$ has many solutions, $A$ is not invertible

if $(A-\lambda I)$ is singular : $det(A-\lambda I)=0$

==> 구한 $eigen \ value$ 를 대입하면, $eigen \ vector$ 구할 수 있다

Code

# a라는 np.array가 있다고 가정

eig = np.linalg.eig(a)

eig[0]
# ==> returns eigenvalue

eig[1]
# ==> returns eigenvector

[Dimension Reduction]

feature 수가 많을수록 overfitting의 문제가 발생함
이를 해결하기 위해서 차원 축소를 해야함 (ex. PCA)

[One-Hot Encoding]

Categorical Data를 Numerical Data로 바꿔주는 작업

Code

$A, B, C$ 라는 Categorical Data를 $\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ 로 바꿔준다

pd.get_dummies(df['feature_1'])
# ==> returns 'feature_1' column의 categorial data를 numerical data로 바꿤 (True : 1 / False : 0)

[PCA (Principal Component Analysis)]

데이터의 차원 $N$(feature 갯수)을 feature들의 분산을 바탕으로 축소시키는게 목적
$n$ ($n \le N$)개의 principal component 만으로 상당 부분의 데이터를 분석할 수 있음

How

데이터를 standardize 시킨다.
origin을 PC1 직선 을 생성하고 각 data를 이 PC1 직선에 수직하게 projection 시킨다. (빨간색 부분)
각 data별 원점에서 projection point까지의 거리를 구한다. (초록색 부분)
그 length를 $h$라고 가정한다면, $H$가 최대가 되는 순간을 찾는다. 이때의 직선이 $eigen \ vector$가 되고 $\frac{H}{n(sample) - 1}$ (즉, variance를 의미) 는 $eigen \ value$가 된다.

($H = h_1^2 + h_2^2 + h_3^2 \dotsm + h_{number \ of \ samples} $)

원점을 지나고 PC1 직선과 perpendicular한 PC2 직선을 생성하고 위 과정을 반복한다.
PC가 더 있다면 위 과정을 계속 반복한다.

Code (Easy Version)

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# df 라는 DataFrame이 있다고 가정

# 데이터 normalized 과정
z = StandardScaler().fit_transform(df) 

# n개의 principal component 생성하겠다는 의미
pca = PCA(n) 

# Projected Data (in DataFrame) 반환
pca.fit_transform(z) 

# eigenvector 반환 (H를 최대로 만드는 vector)
pca.components_ 

# eigenvalues 반환 (variance를 의미)
pca.explained_variance_ 

# pc 개인 분산 / pc 총 분산
pca.explained_variance_ratio

728x90

'[AI] > Data Science Fundamentals' 카테고리의 다른 글

Statistic(4) - Bayesian (0)	2021.01.30
[Statistic (Summary)] T-test & $\chi^2$-test (0)	2021.01.30
Linear Algebra(1) - Vector/Matrix (0)	2021.01.30
Linear Algebra(2) - Span/Basis/Linear Projection (0)	2021.01.30
Linear Algebra(4) - Clustering (0)	2021.01.30

AIStory

AIStory

태그

최근글

댓글

공지사항

아카이브

<Learned Stuff>

[Dimension Reduction]

<New Stuff>

[Eigenvector / Eigenvalue]

[Dimension Reduction]

[One-Hot Encoding]

[PCA (Principal Component Analysis)]

'[AI] > Data Science Fundamentals' 카테고리의 다른 글

관련글

티스토리툴바