๐Ÿ”ฎ ๋ฐ์ดํ„ฐ๋กœ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ˆ˜์ • ๊ตฌ์Šฌ

๋งŒ์•ฝ ์šฐ๋ฆฌ์—๊ฒŒ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ˆ˜์ • ๊ตฌ์Šฌ์ด ์žˆ๋‹ค๋ฉด ์–ด๋–จ๊นŒ? ์ฃผ์‹ ๊ฐ€๊ฒฉ, ๋‚ด๋…„๋„ ๋งค์ถœ, ํ˜น์€ ๋ถ€๋™์‚ฐ ์‹œ์„ธ๊นŒ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋ง์ด๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ํšŒ๊ท€(Regression) ๋ถ„์„์ด ๋ฐ”๋กœ ๊ทธ ์ˆ˜์ • ๊ตฌ์Šฌ๊ณผ ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค. ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ํ•™์Šตํ•ด ์—ฐ์†์ ์ธ ๋ฏธ๋ž˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋„๊ตฌ์ด๋‹ค.

์˜ค๋Š˜์€ ํŒŒ์ด์ฌ์„ ์ด์šฉํ•ด ์ด ํšŒ๊ท€ ๋ถ„์„์˜ ์„ธ๊ณ„๋ฅผ ํƒํ—˜ํ•ด ๋ณด์•˜๋‹ค. ๋‹จ์ˆœํ•œ ๊ฐœ๋… ์ •๋ฆฌ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด, ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ณ , ๋‹จ๊ณ„๋ณ„๋กœ ๊ฐœ์„ ํ•ด๋‚˜๊ฐ€๋ฉฐ ์ตœ์ข…์ ์œผ๋กœ AutoML์„ ํ†ตํ•ด ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ์•„๊ฐ€๋Š” ์—ฌ์ •์„ ์ƒ์ƒํ•˜๊ฒŒ ๋‹ด์•„๋ณด์•˜๋‹ค.

โœจ โ€œํšŒ๊ท€๋Š” ์„ ์„ ๊ธ‹๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฐ์ดํ„ฐ์˜ ๋ชฉ์†Œ๋ฆฌ๋ฅผ ๋“ฃ๋Š” ๊ณผ์ •์ด๋‹ค.โ€

๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆซ์ž ๋’ค์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๊ณ , ๋” ๋‚˜์€ ์˜ˆ์ธก์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ๋‹ด๊ธˆ์งˆํ•˜๋Š” ๊ณผ์ •์ด์•ผ๋ง๋กœ ํšŒ๊ท€ ๋ถ„์„์˜ ์ง„์ •ํ•œ ๋ฌ˜๋ฏธ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿง ํšŒ๊ท€ ๋ถ„์„, ํ•ต์‹ฌ ๊ฐœ๋… ํŒŒ๊ณ ๋“ค๊ธฐ

ํšŒ๊ท€ ๋ถ„์„์„ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ฐœ๋…๊ณผ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.

๊ตฌ๋ถ„ ๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€ ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€
๊ฐœ๋… ํ•˜๋‚˜์˜ ๋…๋ฆฝ๋ณ€์ˆ˜๋กœ ์ข…์†๋ณ€์ˆ˜๋ฅผ ์˜ˆ์ธก ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜๋กœ ์ข…์†๋ณ€์ˆ˜๋ฅผ ์˜ˆ์ธก
์ˆ˜์‹ y = wx + b y = w1x1 + w2x2 + ... + b

๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋‹ค๋ฉด, ๊ทธ ์„ฑ๋Šฅ์„ ์ œ๋Œ€๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ํšŒ๊ท€ ๋ชจ๋ธ์—์„œ๋Š” ์ฃผ๋กœ ์•„๋ž˜ ์ง€ํ‘œ๋“ค์ด ์‚ฌ์šฉ๋œ๋‹ค.

๐Ÿ“ ์ฃผ์š” ํšŒ๊ท€ ํ‰๊ฐ€ ์ง€ํ‘œ

  • Rยฒ (๊ฒฐ์ •๊ณ„์ˆ˜): ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ. 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ข‹๋‹ค.
  • MSE (ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ): ์˜ค์ฐจ์˜ ์ œ๊ณฑ์— ๋Œ€ํ•œ ํ‰๊ท . ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค.
  • RMSE (ํ‰๊ท  ์ œ๊ณฑ๊ทผ ์˜ค์ฐจ): MSE์— ๋ฃจํŠธ๋ฅผ ์”Œ์šด ๊ฐ’. ์‹ค์ œ ๊ฐ’๊ณผ ์œ ์‚ฌํ•œ ๋‹จ์œ„๋ฅผ ๊ฐ€์ ธ ํ•ด์„์ด ์šฉ์ดํ•˜๋‹ค.
  • MAE (ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ): ์˜ค์ฐจ์˜ ์ ˆ๋Œ€๊ฐ’์— ๋Œ€ํ•œ ํ‰๊ท . ์ด์ƒ์น˜์— ๋œ ๋ฏผ๊ฐํ•˜๋‹ค.

์ด ์ง€ํ‘œ๋“ค์„ ์ดํ•ดํ•ด์•ผ ์šฐ๋ฆฌ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ๋˜‘๋˜‘ํ•œ์ง€, ํ˜น์€ ์–ด๋””๊ฐ€ ๋ถ€์กฑํ•œ์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ  ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก ์‹ค์Šต

์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด์•˜๋‹ค. 90๋…„๋Œ€ ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ฃผํƒ ๊ฐ€๊ฒฉ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด, ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์œผ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ด๋‚˜๊ฐ€๋Š” ๊ณผ์ •์„ ๋‹จ๊ณ„๋ณ„๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค.

1๋‹จ๊ณ„: Baseline ๋ชจ๋ธ - ์ฒซ๊ฑธ์Œ ๋–ผ๊ธฐ ๐Ÿšถ

์•„๋ฌด๋Ÿฐ ์ „์ฒ˜๋ฆฌ ์—†์ด ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ LinearRegression ๋ชจ๋ธ์„ ์ ์šฉํ•ด ๋ณด์•˜๋‹ค.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.datasets import fetch_california_housing
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• 
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=121)

# ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€
li_model = LinearRegression()
li_model.fit(X_train, y_train)
y_pred = li_model.predict(X_test)

print(f"R2 Score: {li_model.score(X_test, y_test)}")
print(f"RMSE: {root_mean_squared_error(y_test, y_pred)}")

๊ฒฐ๊ณผ: Rยฒ ์ ์ˆ˜๋Š” 0.62, RMSE๋Š” 0.70์ด ๋‚˜์™”๋‹ค. ๋‚˜์˜์ง„ ์•Š์ง€๋งŒ, ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋งŽ์•„ ๋ณด์˜€๋‹ค. ๐Ÿค”

2๋‹จ๊ณ„: Scaling & Feature Selection - ๋ชจ๋ธ ๋‹ค์ด์–ดํŠธ ๐Ÿƒ

๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•˜๊ณ , SequentialFeatureSelector๋กœ ์ค‘์š”ํ•œ ํŠน์„ฑ๋งŒ ๊ณจ๋ผ๋‚ด ๋ณด์•˜๋‹ค. ํ•˜์ง€๋งŒ LinearRegression ๋ชจ๋ธ์—์„œ๋Š” ์Šค์ผ€์ผ๋ง ํšจ๊ณผ๊ฐ€ ๋ฏธ๋ฏธํ–ˆ๊ณ , ํŠน์„ฑ์„ ๋„ˆ๋ฌด ์ ๊ฒŒ ์„ ํƒํ•˜๋‹ˆ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์กŒ๋‹ค. ์‹คํŒจ ๐Ÿ˜ญ

ํŠน์„ฑ์„ ๋ฌด์ž‘์ • ์ค„์ด๋Š” ๊ฒƒ์ด ๋Šฅ์‚ฌ๋Š” ์•„๋‹ˆ๋ผ๋Š” ๊ตํ›ˆ์„ ์–ป์—ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ๋จผ์ €์˜€๋‹ค.

3๋‹จ๊ณ„: ๊ทœ์ œ ๋ชจ๋ธ (Ridge, Lasso) - ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ํ„ฑ ๋„˜๊ธฐ ๐Ÿšง

๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—๋งŒ ๊ณผํ•˜๊ฒŒ ์ตœ์ ํ™”๋˜๋Š” ๊ณผ์ ํ•ฉ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ๊ทœ์ œ๊ฐ€ ์žˆ๋Š” Ridge, Lasso ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค. RandomizedSearchCV๋กœ ์ตœ์ ์˜ alpha ๊ฐ’์„ ์ฐพ์•„ ์ ์šฉํ–ˆ๋‹ค.

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

param_dist = {"alpha": stats.reciprocal(1e-1, 1e1)}

# Ridge ๋ชจ๋ธ ์ตœ์ ํ™”
ridge = Ridge()
ri_model_search = RandomizedSearchCV(ridge, param_distributions=param_dist, n_iter=70, scoring="neg_root_mean_squared_error", cv=5, random_state=42)
ri_model_search.fit(X_train, y_train)
ri_model = ri_model_search.best_estimator_
y_pred_ridge = ri_model.predict(X_test)

print(f"Ridge RMSE: {root_mean_squared_error(y_test, y_pred_ridge)}")

๊ฒฐ๊ณผ: Ridge ๋ชจ๋ธ์ด Baseline๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๊ทœ์ œ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋” ์•ˆ์ •ํ™”๋œ ๋А๋‚Œ์ด๋‹ค.

4๋‹จ๊ณ„: AutoML - ์ตœ์ข… ๋ณ‘๊ธฐ ๋“ฑ์žฅ ๐Ÿค–

์ˆ˜๋™ ํŠœ๋‹์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด, PyCaret๊ณผ Optuna ๊ฐ™์€ AutoML ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ์ตœ์ ์˜ ๋ชจ๋ธ๊ณผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํƒ์ƒ‰ํ–ˆ๋‹ค.

# PyCaret์œผ๋กœ ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต
from pycaret.regression import *
s = setup(pd.concat([X, y], axis=1), target='MedHouseVal', session_id=123, preprocess=False, verbose=False)
best_model = compare_models()
print(best_model)

PyCaret์€ ๋‹จ ๋ช‡ ์ค„์˜ ์ฝ”๋“œ๋กœ ์ˆ˜๋งŽ์€ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ์ถ”์ฒœํ•ด์ฃผ์—ˆ๋‹ค. CatBoost๋‚˜ ExtraTreesRegressor๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Ÿฐ ์ž๋™ํ™” ๋„๊ตฌ ๋•๋ถ„์— ๋ชจ๋ธ ํƒ์ƒ‰ ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.


โœจ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

์˜ค๋Š˜์€ ํšŒ๊ท€ ๋ถ„์„์˜ ์ด๋ก ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด, ์‹ค์ œ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์ ์ง„์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ์ „ ๊ณผ์ •์„ ๊ฒฝํ—˜ํ–ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๋๋‚ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์™œ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜ค์ง€ ์•Š๋Š”์ง€ ๊ณ ๋ฏผํ•˜๊ณ , ์Šค์ผ€์ผ๋ง, ํŠน์„ฑ ์„ ํƒ, ๊ทœ์ œ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•˜๋Š” ๊ณผ์ •์ด ์ •๋ง ์žฌ๋ฏธ์žˆ์—ˆ๋‹ค.

ํŠนํžˆ AutoML์˜ ๊ฐ•๋ ฅํ•จ์„ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ฌผ๋ก  ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜์ง€๋งŒ, ์ด๋Ÿฐ ์ž๋™ํ™” ๋„๊ตฌ๋ฅผ ์ž˜ ํ™œ์šฉํ•˜๋ฉด ํ›จ์”ฌ ๋” ํšจ์œจ์ ์œผ๋กœ ์ข‹์€ ๋ชจ๋ธ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒ ๋‹ค๋Š” ํ™•์‹ ์ด ๋“ค์—ˆ๋‹ค. ๋‹ค์Œ์—๋Š” ๋ชจ๋ธ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ Optuna๋กœ ๋” ๊นŠ๊ฒŒ ํŠœ๋‹ํ•ด๋ณด๋ฉฐ ์„ฑ๋Šฅ์„ ๊ทนํ•œ๊นŒ์ง€ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์ž‘์—…์„ ํ•ด๋ณผ ๊ณ„ํš์ด๋‹ค. ๐Ÿ˜„