๐Ÿš— ์˜ค๋Š˜์˜ ๋ชฉํ‘œ: ์ž๋™์ฐจ ์—ฐ๋น„ ์˜ˆ์ธก ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ

์˜ค๋Š˜์€ 1970-80๋…„๋Œ€์˜ ํด๋ž˜์‹ํ•œ ์ž๋™์ฐจ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์—ฐ๋น„(MPG)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” DNN ํšŒ๊ท€ ๋ชจ๋ธ์„ PyTorch๋กœ ๋งŒ๋“ค์–ด๋ณด์•˜๋‹ค. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ํŠน์„ฑ ๊ณตํ•™, ๋ชจ๋ธ๋ง, ๊ทธ๋ฆฌ๊ณ  ํ‰๊ฐ€๊นŒ์ง€ ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฒฝํ—˜ํ•ด๋ณด๋Š” ์•Œ์ฐฌ ์‹ค์Šต์ด์—ˆ๋‹ค. ๐Ÿ˜„

โœจ ํ•ต์‹ฌ ์ •๋ฆฌ

#๋ฐ์ดํ„ฐ์ „์ฒ˜๋ฆฌ #ํŠน์„ฑ๊ณตํ•™ #์ •๊ทœํ™” #PyTorch #DNN #ํšŒ๊ท€๋ชจ๋ธ #๊ทœ์ œ


๐Ÿ› ๏ธ ํ•œ๋ˆˆ์— ๋ณด๋Š” ๊ณผ์ •

๋‹จ๊ณ„ ๋‚ด์šฉ ํ•ต์‹ฌ ๊ธฐ์ˆ /๊ฐœ๋…
1. ๋ฐ์ดํ„ฐ ์ค€๋น„ UCI ์„œ๋ฒ„์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€ pandas๋กœ ํ™•์ธํ–ˆ๋‹ค. pandas.read_csv
2. ์ „์ฒ˜๋ฆฌ Horsepower ์—ด์˜ ๊ฒฐ์ธก์น˜๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์ œ๊ฑฐํ–ˆ๋‹ค. dropna()
3. ํŠน์„ฑ ๊ณตํ•™ Model Year๋Š” ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ณ , Origin์€ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์‹œ๋„ํ–ˆ๋‹ค. torch.bucketize, one_hot
4. ์ •๊ทœํ™” StandardScaler๋กœ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ๋“ค์˜ ์Šค์ผ€์ผ์„ ๋งž์ท„๋‹ค. ์•ˆ ํ•˜๋ฉด ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง„๋‹ค. StandardScaler
5. ๋ชจ๋ธ๋ง ๋ฐ ํ•™์Šต PyTorch๋กœ ๊ฐ„๋‹จํ•œ DNN ํšŒ๊ท€ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์—ด์‹ฌํžˆ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๐Ÿค“ torch.nn.Sequential, MSELoss, SGD
6. ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐ ๊ทœ์ œ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ MSE์™€ MAE๋กœ ํ‰๊ฐ€ํ•˜๊ณ , ๊ณผ์ ํ•ฉ์„ ๋ง‰๊ธฐ ์œ„ํ•œ L1/L2 ๊ทœ์ œ๋„ ์‚ดํŽด๋ณด์•˜๋‹ค. L1Loss, weight_decay

๐Ÿ“œ ์‚ฝ์งˆ๊ณผ ๋ฐฐ์›€์˜ ๊ธฐ๋ก

1. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์ „์ฒ˜๋ฆฌ

  • ๋ฐ์ดํ„ฐ ๋กœ๋“œ: pd.read_csv๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™”๋‹ค. ๊ณต๋ฐฑ์œผ๋กœ ๊ตฌ๋ถ„๋œ ํŒŒ์ผ์ด๋ผ sep=" " ์˜ต์…˜์„ ์คฌ๋‹ค.
  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: Horsepower ์—ด์— ?๊ฐ€ ์„ž์—ฌ์žˆ์—ˆ๋‹ค. na_values='?'๋กœ NaN ๋ณ€ํ™˜ ํ›„ dropna()๋กœ ๋‚ ๋ ค๋ฒ„๋ ธ๋‹ค. ์† ์‹œ์›! ๐Ÿ‘
  • ๋ฐ์ดํ„ฐ ๋ถ„ํ• : train_test_split์œผ๋กœ ํ›ˆ๋ จ์šฉ๊ณผ ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆด๋‹ค. ๊ธฐ๋ณธ ์ค‘์˜ ๊ธฐ๋ณธ์ด๋‹ค.
# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ์ „์ฒ˜๋ฆฌ
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names, na_values="?", comment='\t', sep=" ", skipinitialspace=True)
df = df.dropna()

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(df.iloc[:, 1:], df['MPG'], test_size=0.2, random_state=121)

2. ๋ฐ์ดํ„ฐ ์ •๊ทœํ™” (Normalization)

์ •๊ทœํ™”๋Š” ์„ ํƒ์ด ์•„๋‹Œ ํ•„์ˆ˜! โœจ
ํŠน์„ฑ๋งˆ๋‹ค ๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ ๋„ˆ๋ฌด ๋‹ฌ๋ผ์„œ(Weight vs Cylinders) ์ •๊ทœํ™”๋Š” ํ•„์ˆ˜์˜€๋‹ค. ์•ˆ ๊ทธ๋Ÿฌ๋ฉด ๋ชจ๋ธ์ด ํฐ ๊ฐ’์—๋งŒ ํœ˜๋‘˜๋ฆด ์ˆ˜ ์žˆ๋‹ค. StandardScaler๋ฅผ ์จ์„œ ๊ฐ„๋‹จํžˆ ํ•ด๊ฒฐํ–ˆ๋‹ค.

from sklearn.preprocessing import StandardScaler

numeric_columns = ['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration']
scaler = StandardScaler()

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๊ธฐ์ค€์œผ๋กœ ์Šค์ผ€์ผ๋Ÿฌ ํ•™์Šต ๋ฐ ์ ์šฉ
X_train_df[numeric_columns] = scaler.fit_transform(X_train_df[numeric_columns])
# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ํ•™์Šต๋œ ์Šค์ผ€์ผ๋Ÿฌ๋กœ ๋ณ€ํ™˜๋งŒ!
X_test_df[numeric_columns] = scaler.transform(X_test_df[numeric_columns])

3. ํŠน์„ฑ ๊ณตํ•™ (Feature Engineering)

  • Model Year: ์—ฐ๋„๋ฅผ ๋ช‡ ๊ฐœ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์—ˆ๋‹ค (torch.bucketize). ๋„ˆ๋ฌด ๋งŽ์€ ์—ฐ๋„๋ฅผ ๊ทธ๋Œ€๋กœ ์“ฐ๊ธฐ๋ณด๋‹ค ๊ทธ๋ฃน์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒŒ ๋‚ซ๋‹ค๊ณ  ํŒ๋‹จํ–ˆ๋‹ค.
  • Origin: ์ œ์กฐ ๊ตญ๊ฐ€๋ฅผ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ(one_hot)ํ–ˆ๋‹ค. ๋ชจ๋ธ์ด ๊ตญ๊ฐ€๋ฅผ ์„œ์—ด๋กœ ์˜คํ•ดํ•˜๋ฉด ์•ˆ ๋˜๋‹ˆ๊นŒ!
from torch.nn.functional import one_hot

# Model Year ๋ฒ„ํ‚ทํ™”
boundaries = torch.tensor([73, 76, 79])
X_train_df['Model Year Bucketed'] = torch.bucketize(torch.tensor(X_train_df['Model Year'].values), boundaries, right=True)
X_test_df['Model Year Bucketed'] = torch.bucketize(torch.tensor(X_test_df['Model Year'].values), boundaries, right=True)

# Origin ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ๋ฐ ํ…์„œ ๊ฒฐํ•ฉ
train_origin_encoded = one_hot(torch.tensor(X_train_df['Origin'].values) % 3)
test_origin_encoded = one_hot(torch.tensor(X_test_df['Origin'].values) % 3)

x_train = torch.cat([torch.tensor(X_train_df[numeric_columns + ['Model Year Bucketed']].values), train_origin_encoded], 1).float()
x_test = torch.cat([torch.tensor(X_test_df[numeric_columns + ['Model Year Bucketed']].values), test_origin_encoded], 1).float()

y_train = torch.FloatTensor(y_train_df.values)
y_test = torch.FloatTensor(y_test_df.values)

4. DNN ๋ชจ๋ธ ๊ตฌ์ถ• ๋ฐ ํ•™์Šต

  • ๋ชจ๋ธ ์„ค๊ณ„: torch.nn.Sequential๋กœ ๊ฐ„๋‹จํ•œ DNN ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋‹ค. ์€๋‹‰์ธต 2๊ฐœ์— ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ReLU!
  • ์†์‹ค ํ•จ์ˆ˜์™€ ์˜ตํ‹ฐ๋งˆ์ด์ €: ํšŒ๊ท€ ๋ฌธ์ œ๋ผ ์†์‹ค ํ•จ์ˆ˜๋Š” MSELoss, ์˜ตํ‹ฐ๋งˆ์ด์ €๋Š” ํด๋ž˜์‹ํ•œ SGD๋ฅผ ์ผ๋‹ค.
  • ํ•™์Šต ์‹œ์ž‘!: DataLoader๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ๊ธˆ์”ฉ ๋ชจ๋ธ์— ์ฃผ๋ฉด์„œ 200 ์—ํฌํฌ ๋™์•ˆ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์†์‹ค์ด ์ญ‰์ญ‰ ๋–จ์–ด์ง€๋Š” ๊ฑธ ๋ณด๋‹ˆ ๋ฟŒ๋“ฏํ–ˆ๋‹ค. ๐ŸŽ‰
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn

# ๋ฐ์ดํ„ฐ ๋กœ๋” ์ƒ์„ฑ
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=8, shuffle=True)

# ๋ชจ๋ธ ์ •์˜
hidden_units = [8, 4]
input_size = x_train.shape[1]
all_layers = []
for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)

# ์†์‹ค ํ•จ์ˆ˜ ๋ฐ ์˜ตํ‹ฐ๋งˆ์ด์ €
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

# ๋ชจ๋ธ ํ•™์Šต
torch.manual_seed(1)
num_epochs = 200
for epoch in range(num_epochs):
    loss_train = 0
    for X, y in train_dl:
        pred = model(X).squeeze()
        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_train += loss.item()
    if epoch % 20 == 0:
        print(f'Epoch {epoch}, Loss {loss_train/len(train_dl):.4f}')

5. ๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐ ๊ทœ์ œ

  • ์„ฑ๋Šฅ ํ‰๊ฐ€: ํ•™์Šต์ด ๋๋‚œ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ‰๊ฐ€ํ–ˆ๋‹ค. MSE์™€ MAE ๊ฐ’์„ ํ™•์ธํ•˜๋‹ˆ, ๋‚˜์˜์ง€ ์•Š์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค.
  • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€: ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—๋งŒ ๋„ˆ๋ฌด ์ต์ˆ™ํ•ด์ง€๋Š” ๊ฑธ ๋ง‰๊ธฐ ์œ„ํ•ด ๊ทœ์ œ(Regularization)๋ผ๋Š” ๊ธฐ๋ฒ•์ด ์žˆ๋‹ค.
    • L1 (Lasso): ๋ถˆํ•„์š”ํ•œ ํŠน์„ฑ์˜ ๊ฐ€์ค‘์น˜๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๋ฒ„๋ฆฐ๋‹ค.
    • L2 (Ridge): ๊ฐ€์ค‘์น˜๋“ค์ด ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๊ฒŒ ์ „๋ฐ˜์ ์œผ๋กœ ๋ˆ„๋ฅด๋Š” ๋А๋‚Œ. SGD ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ weight_decay๋กœ ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
# ๋ชจ๋ธ ํ‰๊ฐ€
model.eval()
with torch.no_grad():
    pred = model(x_test).squeeze()
    test_mse = loss_fn(pred, y_test)
    test_mae = nn.L1Loss()(pred, y_test)
    print(f'Test MSE: {test_mse.item():.4f}') # ๊ฒฐ๊ณผ ํ™•์ธ!
    print(f'Test MAE: {test_mae.item():.4f}')

# L2 ๊ทœ์ œ ์ ์šฉ ์˜ˆ์‹œ
optimizer_l2 = torch.optim.SGD(model.parameters(), lr=0.001, weight_decay=0.5)

โœจ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ๋ชจ๋ธ๋ง, ํ‰๊ฐ€๊นŒ์ง€ ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฒฝํ—˜ํ•ด๋ณธ ์ข‹์€ ์‹ค์Šต์ด์—ˆ๋‹ค. ํŠนํžˆ ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”์˜ ์ค‘์š”์„ฑ์„ ๋‹ค์‹œ ํ•œ๋ฒˆ ๋А๊ผˆ๋‹ค. PyTorch๋กœ ์ง์ ‘ ๋ชจ๋ธ์„ ์งœ๋ณด๋‹ˆ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๊ฐ€ ๋” ๋ช…ํ™•ํ•˜๊ฒŒ ์ดํ•ด๋๊ณ , ํ›ˆ๋ จ ์†์‹ค๊ณผ ํ…Œ์ŠคํŠธ ์†์‹ค์„ ๋น„๊ตํ•˜๋ฉฐ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๊ฐ€๋Š ํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๊ณ , ๊ทœ์ œ์˜ ํ•„์š”์„ฑ๋„ ์ฒด๊ฐํ–ˆ๋‹ค.

๋‹ค์Œ์—๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์œผ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋” ๋Œ์–ด์˜ฌ๋ ค ๋ด์•ผ๊ฒ ๋‹ค.