Learned Stuff

Key Points

Pipeline
Decision Tree
Feature Importance

New Stuff

[Pipeline]

하나의 model을 pipeline에 담는다고 생각하면 된다
- 코드의 가독성이 높아진다

코드를 짠 순서에 따라 순차적으로 진행된다

code

from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# X_train / y_train / X_validation / y_validation / X_test data가 있다고 가정

pipe = make_pipeline(
    OneHotEncoder(), 
    SimpleImputer(), 
    StandardScaler(), 
    LogisticRegression()
) # hyper parameter값으로 조정 가능
pipe.fit(X_train, y_train) # model fit 시키기

pipe.score(X_validation, y_validation)) # accuracy 반환
pipe.predict(X_test) # test data에 대한 예측값 반환
pipe.named_steps # 각 속성이 나타난다

# ex)
enc = pipe.named_steps['onehotencoder'] # pipeline에 담긴 onehotencoder를 빼온다는 개념
enc.fit(X_validation) # validation data에 one-hot encoding을 진행

[Decision Tree]

data를 참/거짓 으로 계속해서 분류하는 것을 의미한다
Root Node / Decision Node / Leaf Node 로 분류된다
Impurity (불순도가 높을수록 골고루 섞여있다는 것을 의미)
- (class 가 2개인 경우 예시)
- Gini Impurity
  - $Gini = 1-P(YES)^2-P(NO)^2$
  - $Gini Impurity(weighted impurity) = P_{split1}\times Gini_{split1} + P_{split2}\times Gini_{split2}$
- Entropy:
  - $Entropy=-P(YES)\times\log P(YES)-P(NO)\times\log P(NO)$
  - $Weighted Entropy = P_{split1}\times Entropy_{split1} + P_{split2}\times Entorpy_{split2} $

How

feature들 중 target variable간의 Gini Impurity(or Weighted Entropy)가 가장 낮은 feature를 Root Node로 설정한다.

split된 node의 Gini(or Entropy)를 구한다.

나머지 feature들 중 split된 node를 기준으로 Gini Impurity(or Weighted Entropy)를 구한다.

If Gini > Gini Impurity : feature 들 중 Gini Impurity가 가장 낮은 것을 Decision Node로 설정한다.
If Gini < Gini Impurity : 더이상 split할 수 없다는 것을 의미하고 split된 node들을 Leaf Node로 설정한다.

2~4 과정을 split할 수 없을 때까지 반복한다.

Code

from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

# X_train / y_train / X_validation / y_validation 이라는 dataframe이 있다고 가정

pipe = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    SimpleImputer(), 
    DecisionTreeClassifier(random_state=, criterion=, min_samples_split=, min_samples_leaf=, max_depth=)
)
# random_state : randomness 지정
# criterion : 'gini' or 'entropy' (default='gini')
# min_samples_split : split하기 위한 최소로 있어야 되는 sample 수 지정 (default = 2)
# min_samples_leaf : leaf node에 최소로 있어야 되는 sample 수 지정 (default = 1)
# max_depth : maximum depth 지정 (default = None)


pipe.fit(X_train, y_train) # model fit 시키기

pipe.score(X_validation, y_validation)) # accuracy score 반환

# attributes
.get_depth() # depth 수 반환
.get_n_leaves() # leaf nodes 수 반환

[Feature Importance]

train data의 feature 와 target 간의 관계를 coefficient로 표현한 것
- coefficient가 클수록 관계성이 높다는 것을 의미

from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

# X_train / y_train 이라는 dataframe이 있다고 가정

pipe = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    SimpleImputer(), 
    DecisionTreeClassifier(random_state=, criterion=, min_samples_split=, min_samples_leaf=, max_depth=)
)

pipe.fit(X_train, y_train) # model fit 시키기

# decision tree classifier 추출
decision_tree_classifier = pipe.named_steps['decisiontreeclassifier']

# feature importance coefficients 담기
importances = pd.Series(decision_tree_classifier.feature_importances_, X_train.columns)

# barplot 시각화
importances.sort_values().plot.barh();

Visualization

728x90

'[AI] > Machine Learning' 카테고리의 다른 글

Linear Models(3) - Ridge Regression (0)	2021.03.07
Linear Models(4) - Logistic Regression (0)	2021.03.07
Tree Based Model(2) - Random Forests (0)	2021.03.07
Tree Based Model(3) - Evaluation Metrics for Classification (0)	2021.03.07
Tree Based Model(4) - Model Selection (0)	2021.03.07

AIStory

AIStory

태그

최근글

댓글

공지사항

아카이브

Learned Stuff

New Stuff

[Pipeline]

[Decision Tree]

[Feature Importance]

'[AI] > Machine Learning' 카테고리의 다른 글

관련글

티스토리툴바