๐Ÿ“œ ๋ฐฑ์ง€์žฅ๋„ ๋งž๋“ค๋ฉด ๋‚ซ๋‹ค!

โ€œ๋ฐฑ์ง€์žฅ๋„ ๋งž๋“ค๋ฉด ๋‚ซ๋‹คโ€๋Š” ์†๋‹ด์ด ์žˆ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์˜ ์„ธ๊ณ„์—๋„ ์ด ์†๋‹ด์ด ๊ทธ๋Œ€๋กœ ์ ์šฉ๋œ๋‹ค. ๋ฐ”๋กœ ์•™์ƒ๋ธ”(Ensemble) ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•ด์„œ๋‹ค. ์•™์ƒ๋ธ”์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ(๋ถ„๋ฅ˜๊ธฐ)์„ ์—ฐ๊ฒฐํ•˜์—ฌ ๋‹จ์ผ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋˜‘๋˜‘ํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š”, ๋ง ๊ทธ๋Œ€๋กœ ์ง‘๋‹จ์ง€์„ฑ์˜ ํž˜์„ ๋นŒ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์˜ค๋Š˜์€ ์ด ์•™์ƒ๋ธ”์˜ ์„ธ๊ณ„๋ฅผ ๊นŠ์ด ํƒํ—˜ํ•ด ๋ณด์•˜๋‹ค. ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ํˆฌํ‘œ ๋ฐฉ์‹๋ถ€ํ„ฐ, ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ์ž๋ž‘ํ•˜๋Š” ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ, ๊ทธ๋ฆฌ๊ณ  ์บ๊ธ€๊ณผ ๊ฐ™์€ ๊ฒฝ์ง„๋Œ€ํšŒ์—์„œ ์™•์ขŒ๋ฅผ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋Š” ๋ถ€์ŠคํŒ… ๋ชจ๋ธ๋“ค๊นŒ์ง€! ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์กฐํ™”๋กญ๊ฒŒ ์‚ฌ์šฉํ•ด ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์—ฌ์ •์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

โœจ โ€œํ•˜๋‚˜์˜ ์ฒœ์žฌ๋ณด๋‹ค ์—ฌ๋Ÿฌ ๋ช…์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ๋‚ซ๋‹ค.โ€

์•™์ƒ๋ธ” ํ•™์Šต์€ ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๊ฐ•์ ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ ์˜๊ฒฌ์„ ์ข…ํ•ฉํ•˜์—ฌ ๋” ์‹ ์ค‘ํ•˜๊ณ  ์ •ํ™•ํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š”, ํ˜„์‹ค ์„ธ๊ณ„์˜ ์ „๋ฌธ๊ฐ€ ์œ„์›ํšŒ์™€๋„ ๊ฐ™๋‹ค.


๐Ÿ—ณ๏ธ 1. Voting: ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์ง‘๋‹จ์ง€์„ฑ

๊ฐ€์žฅ ์ง๊ด€์ ์ธ ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์€ ํˆฌํ‘œ(Voting) ์ด๋‹ค. ์—ฌ๋Ÿฌ ๋ชจ๋ธ์ด ๊ฐ์ž ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์ˆ˜๊ฒฐ๋กœ ์ตœ์ข… ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์ด๋‹ค.

๊ตฌ๋ถ„ Hard Voting Soft Voting
๋ฐฉ์‹ ๋‹จ์ˆœ ๋‹ค์ˆ˜๊ฒฐ ํˆฌํ‘œ ๊ฐ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ํ™•๋ฅ ์˜ ํ‰๊ท ์„ ๋‚ด์–ด ๊ฒฐ์ •
ํŠน์ง• ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜๋‹ค ์ผ๋ฐ˜์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ๋” ์šฐ์ˆ˜ํ•˜๋‹ค

์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ์…‹์„ 3๊ฐœ์˜ ๋‹ค๋ฅธ ๋ชจ๋ธ(๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, KNN, ๊ฒฐ์ • ํŠธ๋ฆฌ)๋กœ ํ•™์Šต์‹œํ‚จ ๋’ค, VotingClassifier๋กœ ๋ฌถ์–ด ์„ฑ๋Šฅ์„ ํ™•์ธํ•ด ๋ณด์•˜๋‹ค.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• 
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=121, test_size=0.2)

# ๊ฐœ๋ณ„ ๋ชจ๋ธ ๋ฐ VotingClassifier ์ƒ์„ฑ
lr_clf = LogisticRegression(max_iter=10000)
knn_clf = KNeighborsClassifier()
dt_clf = DecisionTreeClassifier()

vo_clf_hard = VotingClassifier([('LR', lr_clf), ('KNN', knn_clf), ('DT', dt_clf)], voting="hard")

# ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€
vo_clf_hard.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {vo_clf_hard.score(X_test, y_test)}")

๊ฒฐ๊ณผ: Hard Voting์œผ๋กœ 97.3%์˜ ์ •ํ™•๋„๋ฅผ ์–ป์—ˆ๋‹ค. ๊ฐ ๋ชจ๋ธ์„ ๋‹จ๋…์œผ๋กœ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๐Ÿ‘


๐ŸŒณ 2. Bagging & Random Forest: ๊ฐ™์€ ๋ชจ๋ธ, ๋‹ค๋ฅธ ์ƒ๊ฐ

๋ฐฐ๊น…(Bagging) ์€ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ชจ๋ธ ์—ฌ๋Ÿฌ ๊ฐœ์—๊ฒŒ ์•ฝ๊ฐ„์”ฉ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ์–ด ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด๋‹ค. ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ๋ณต์›์ถ”์ถœ(Bootstrap) ํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค๊ณ , ๊ฐ ๋ชจ๋ธ์ด ์ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ณ‘๋ ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ์•ฝ์ ์€ ๋ณด์™„๋˜๊ณ  ์ „์ฒด ๋ชจ๋ธ์˜ ๊ณผ์ ํ•ฉ์€ ์ค„์–ด๋“ ๋‹ค.

์ด ๋ฐฐ๊น… ๋ฐฉ์‹์˜ ๋Œ€ํ‘œ์ฃผ์ž๊ฐ€ ๋ฐ”๋กœ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(Random Forest) ์ด๋‹ค.

from sklearn.ensemble import RandomForestClassifier

# RandomForest ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
rf_clf = RandomForestClassifier(random_state=121, n_estimators=100) # 100๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉ
rf_clf.fit(X_train, y_train)

print(f"Random Forest Accuracy: {rf_clf.score(X_test, y_test)}")

๊ฒฐ๊ณผ: ๋ฌด๋ ค 98.2%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค! ๋ณ„๋‹ค๋ฅธ ํŠœ๋‹ ์—†์ด๋„ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”, ์ •๋ง ๋“ ๋“ ํ•œ ๋ชจ๋ธ์ด๋‹ค. ๐ŸŽ‰


โœจ 3. Boosting: ์„ ๋ฐฐ์˜ ์‹ค์ˆ˜๋ฅผ ๊ตํ›ˆ ์‚ผ์•„

๋ถ€์ŠคํŒ…(Boosting) ์€ ์•ฝํ•œ ๋ชจ๋ธ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ์ ์ด ๋ฐฐ๊น…๊ณผ ๋‹ค๋ฅด๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๊ณ  ๋‚˜๋ฉด, ๊ทธ ๋ชจ๋ธ์ด ํ‹€๋ฆฐ ๋ฌธ์ œ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋‹ค์Œ ๋ชจ๋ธ์ด ๋” ์ง‘์ค‘์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ ๋ฐฐ์˜ ์‹ค์ˆ˜๋ฅผ ๊ตํ›ˆ ์‚ผ์•„ ๊ณ„์† ๋ฐœ์ „ํ•ด๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹์ด๋‹ค.

๊ฐ€. GBM (Gradient Boosting Machine)

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•ด ์˜ค์ฐจ๋ฅผ ๋ณด์™„ํ•ด๋‚˜๊ฐ€๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ์ด๋‹ค.

๋‚˜. XGBoost (eXtra Gradient Boosting)

GBM์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•œ ๋ชจ๋ธ๋กœ, ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ(GPU) ์ง€์›, ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๊ทœ์ œ, ์กฐ๊ธฐ ์ค‘๋‹จ ๊ธฐ๋Šฅ ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์†๋„์™€ ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ์žก์•˜๋‹ค. ์บ๊ธ€๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ๊ฒฝ์ง„๋Œ€ํšŒ์—์„œ โ€œ์ผ๋‹จ XGBoost๋ถ€ํ„ฐ ์“ฐ๊ณ  ๋ณธ๋‹คโ€๋Š” ๋ง์ด ์žˆ์„ ์ •๋„๋ผ๊ณ  ํ•œ๋‹ค.

from xgboost import XGBClassifier

# XGBoost ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ์กฐ๊ธฐ ์ค‘๋‹จ ์„ค์ •
xgbc = XGBClassifier(n_estimators=1000, early_stopping_rounds=10, eval_metric='logloss')
xgbc.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

print(f"XGBoost Accuracy: {xgbc.score(X_test, y_test)}")

๊ฒฐ๊ณผ: ์—ญ์‹œ 98.2%! ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ๋™์ผํ•œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์กฐ๊ธฐ ์ค‘๋‹จ ๊ธฐ๋Šฅ ๋•๋ถ„์— ๋ถˆํ•„์š”ํ•œ ํ•™์Šต์„ ๋ง‰์•„ ํšจ์œจ์ ์ด์—ˆ๋‹ค.

๋‹ค. LightGBM

XGBoost๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ์„ ๋ชฉํ‘œ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๋ฆฌํ”„ ์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์—์„œ ํŠนํžˆ ๊ฐ•์ ์„ ๋ณด์ธ๋‹ค.


โœจ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

์˜ค๋Š˜์€ ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์„ ๊ณต๋ถ€ํ•˜๋ฉฐ ๋ชจ๋ธ ๊ฐ„ ์ง‘๋‹จ์ง€์„ฑ์˜ ํž˜์„ ์ง์ ‘ ํ™•์ธํ–ˆ๋‹ค. Voting, Bagging, Boosting ์ด๋ผ๋Š” ์„ธ ๊ฐ€์ง€ ํฐ ์ถ•์„ ์ค‘์‹ฌ์œผ๋กœ, ๊ฐ ๊ธฐ๋ฒ•์ด ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š”์ง€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํŠนํžˆ ๋‹จ์ผ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์—๋งŒ ๋งค๋‹ฌ๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ํ˜„๋ช…ํ•˜๊ฒŒ ์กฐํ•ฉํ•˜๋Š” โ€˜์•„ํ‚คํ…์ฒ˜โ€™ ์„ค๊ณ„๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ๊นจ๋‹ฌ์•˜๋‹ค.

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ์•ˆ์ •์„ฑ๊ณผ XGBoost์˜ ๊ฐ•๋ ฅํ•จ์ด ํŠนํžˆ ์ธ์ƒ ๊นŠ์—ˆ๋‹ค. ์™œ ๋งŽ์€ ํ˜„์—… ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋“ค์ด ์ด ๋ชจ๋ธ๋“ค์„ ์‚ฌ๋ž‘ํ•˜๋Š”์ง€ ์•Œ ๊ฒƒ ๊ฐ™์•˜๋‹ค.! ๐Ÿ˜„