Learned Stuff

Key Points

Ensemble Model

Random Forest
- Bagging

Ordinal Encoder

New Stuff

[Ensemble Model]

여러개의 data set에 대한 각각의 model을 함께 사용해 기존보다 성능을 올리는 modeling 방법

Diagram

[Random Forest]

decision tree model의 Ensemble model라고 할 수 있습니다.
부가적으로 Bagging 방법으로 구현합니다.

Diagram

Bagging

Bootstrapping + Aggregating
- Bootstrapping : data 를 sampling해서 하나의 subset으로 만드는 방법
- Aggregating : 나눈 subset에서 만든 각각의 decision tree를 하나로 합치는 과정
  - regression model : 각각의 decision tree에서 나온 결과를 평균으로 결과를 예측함
  - classification model : 각각의 decision tree에서 나온 결과 중 다수결로 결과를 예측함

Bootstrapping 과정

Original Data에서 random하게 sampling 한다.
- 이때, 한번 추출된 data가 또 다시 추출될 수도 있다.

Sampling 단계에서 뽑히지 않은 data를 Out-of-Bag(OOB) sample이라고 한다.
- 이후 검증단계에서 사용될 data이다.

각각의 Bootstrapped Data에서 Decision Tree Model을 만들고 OOB sample을 가지고 검증을 진행한다.
- Aggregation을 진행하다고 보면 된다. (방법은 위에 명시)

Code

from category_encoders import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer 
from sklearn.pipeline import make_pipeline

# X_train / y_train / X_validation / y_validation data 가 있다고 가정

pipe = make_pipeline(
    OneHotEncoder(), # encoding 수행
    SimpleImputer(), # nan값 처리하기 (default : 각 column의 mean으로 처리)
    RandomForestClassifier(
        n_jobs=-1, 
        random_state=, 
        oob_score=True, 
        n_estimators=, 
        criterion=, 
        max_depth=,
        min_samples_split=,
        min_samples_leaf=,
        max_features=
        )
)
# n_jobs : -1 로 설정해주면 모든 processers 가동한다는 의미 (좀 더 빠르게 model 학습시킬 수 있음)

# random_state : int 형 / seed값 설정해주면 sample bootstrapping 를 control할 수 있음

# oob_score : True 로 설정해주면 OOB sample을 써서 검증을 진행하겠다는 의미

# n_estimators : # of trees 지정

# criterion : 'gini' or 'entropy'

# max_depth : maximum depth 설정

# min_samples_split : node를 split (가지치기) 하는 최소한의 sample 수 지정

# min_samples_leaf : leaf node 가 되기 위한 최소한의 sample 수 지정

# max_features : 'auto' or 'sqrt' or 'log2' or int형 or float형 (default : auoto) tree에 담을 최대한의 feature의 수 지정 
    # ex) sqrt로 설정해주면 총 data feature 수에 루트를 씌어준 값을 tree에 담을 최대한의 feature수로 본다는 의미


pipe.fit(X_train, y_train) # train data에 학습시키기

# Attributes
pipe.score(X_validation, y_validation) # validation data의 accuracy 반환

pipe.named_steps['randomforestclassifier'].oob_score_ # OOB sample에 대한 accuracy 반환

pipe.predict(X_test) # test data에 대한 예측값 반환

[Ordinal Encoder]

순서가 있는 Categorical Data에 쓸 수 있습니다.
- Ex) 개/고양이는 순서가 없음
- Ex) 영화 평점 1점/2점/3점 은 순서가 있음

One-Hot Encoder완 다르게 feature의 수가 증가하지 않고 해당 column에서 encoding이 진행됩니다.

Code

Colab 에서 사용할려면 pip install을 진행해야 함

pip install category_encoders

from category_encoders import OrdinalEncoder

# X_train / X_validation 이라는 data가 있다고 가정

# ord_enc 에 OrdinalEncoder() 담기
ord_enc = OrdinalEncoder()

ord_enc.fit_transform(X_train) # train data 에 의해 fit 하고 변환시켜줌

ord_enc.transform(X_validation) # validation data 변환

728x90

'[AI] > Machine Learning' 카테고리의 다른 글

Linear Models(4) - Logistic Regression (0)	2021.03.07
Tree Based Model(1) - Decision Trees (0)	2021.03.07
Tree Based Model(3) - Evaluation Metrics for Classification (0)	2021.03.07
Tree Based Model(4) - Model Selection (0)	2021.03.07
Applied Predictive Modeling(1) - Choosing ML Problems (0)	2021.03.07

AIStory

AIStory

태그

최근글

댓글

공지사항

아카이브

Learned Stuff

New Stuff

[Ensemble Model]

[Random Forest]

Bagging

Bootstrapping 과정

[Ordinal Encoder]

'[AI] > Machine Learning' 카테고리의 다른 글

관련글

티스토리툴바