๐Ÿ˜ตโ€๐Ÿ’ซ ์ฐจ์›์˜ ์ €์ฃผ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„๋„ ๋ฌธ์ œ?

๋ฐ์ดํ„ฐ์˜ ์„ธ๊ณ„๋Š” ์ข…์ข… ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋กœ ๊ฐ€๋“ ์ฐฌ ๊ฑฐ๋Œ€ํ•œ ๋„์„œ๊ด€๊ณผ ๊ฐ™๋‹ค. ๋ชจ๋“  ์ฑ…(ํŠน์„ฑ)์„ ๋‹ค ์ฝ์œผ๋ ค๋‹ค ๊ธธ์„ ์žƒ๊ธฐ ์‹ญ์ƒ์ด๋‹ค. ๋ชจ๋ธ ํ•™์Šต๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ํŠน์„ฑ(์ฐจ์›)์ด ๋„ˆ๋ฌด ๋งŽ์•„์ง€๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” โ€˜์ฐจ์›์˜ ์ €์ฃผโ€™ ์— ๊ฑธ๋ฆฌ๊ฒŒ ๋œ๋‹ค.

๐Ÿ’ก โ€œ์ฐจ์›์ด ๋Š˜์–ด๋‚ ์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์€ ๊ธ‰๊ฒฉํžˆ ๋„“์–ด์ง€๊ณ , ๋ฐ์ดํ„ฐ๋Š” ํฌ์†Œ(sparse)ํ•ด์ง„๋‹ค.โ€

๋ชจ๋ธ์ด ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ๋Š” ํ•œ์ •์ ์ธ๋ฐ, ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ณต๊ฐ„๋งŒ ๋„“์–ด์ง€๋‹ˆ ํ•™์Šต์ด ์ œ๋Œ€๋กœ ๋  ๋ฆฌ๊ฐ€ ์—†๋‹ค. ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋Š” ๋ค์ด๋‹ค.

์ด ์ €์ฃผ๋ฅผ ํ’€๊ธฐ ์œ„ํ•œ ๋งˆ๋ฒ•์ด ๋ฐ”๋กœ ์ฐจ์› ์ถ•์†Œ์ด๊ณ , ์˜ค๋Š˜ ์šฐ๋ฆฌ๋Š” ๊ทธ์ค‘ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋งˆ๋ฒ•์ธ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA) ๊ณผ ์‹œ๊ฐํ™”์˜ ๋‹ฌ์ธ t-SNE๋ฅผ ๋ฐฐ์› ๋‹ค.


๐Ÿง™โ€โ™‚๏ธ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA), ๋ฐ์ดํ„ฐ์˜ ๋ณธ์งˆ์„ ์ฐพ์•„์„œ

PCA๋Š” ๋ฐ์ดํ„ฐ์— ํฉ์–ด์ ธ ์žˆ๋Š” ์—ฌ๋Ÿฌ ํŠน์„ฑ๋“ค์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ณ , ์ด๋“ค์„ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ถ•, ์ฆ‰ ์ฃผ์„ฑ๋ถ„(Principal Component) ์„ ์ฐพ์•„๋‚ด๋Š” ๊ธฐ๋ฒ•์ด๋‹ค. ๋‹จ์ˆœํžˆ ๋ช‡๋ช‡ ํŠน์„ฑ์„ ๋ฒ„๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๊ธฐ์กด ํŠน์„ฑ๋“ค์„ ์กฐํ•ฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ๊ฐ€์žฅ ์ž˜ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ์ด๋‹ค.

์„ ํ˜•๋Œ€์ˆ˜ํ•™์˜ ๊ด€์ ์—์„œ PCA๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ์„ ๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์ด๋•Œ ์–ป์–ด์ง€๋Š” ๊ณ ์œ ๋ฒกํ„ฐ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์ด ๊ฐ€์žฅ ํฐ ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ฃผ์„ฑ๋ถ„์ด ๋œ๋‹ค. ์–ด๋ ต๊ฒŒ ๋“ค๋ฆฌ์ง€๋งŒ, โ€˜๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ถ•์„ ์ฐพ๋Š”๋‹คโ€™ ๋Š” ๊ฐœ๋…๋งŒ ๊ธฐ์–ตํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค!


๐Ÿš€ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ PCA ์‹ค์Šตํ•˜๊ธฐ

๋ฐฑ๋ฌธ์ด ๋ถˆ์—ฌ์ผ๊ฒฌ! Scikit-learn์˜ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ PCA์˜ ์œ„๋ ฅ์„ ์ง์ ‘ ํ™•์ธํ•ด๋ดค๋‹ค. 4๊ฐœ์˜ ํŠน์„ฑ์„ 2๊ฐœ๋กœ ์ค„์—ฌ ์›๋ณธ๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ๋‹ค.

1. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์Šค์ผ€์ผ๋ง

๋จผ์ € ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ํ‘œ์ค€ํ™”๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค. PCA๋Š” ๋ฐ์ดํ„ฐ์˜ ์Šค์ผ€์ผ์— ์˜ํ–ฅ์„ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— ์Šค์ผ€์ผ๋ง์€ ํ•„์ˆ˜์ ์ธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด๋‹ค.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ์Šค์ผ€์ผ๋ง
iris = load_iris()
scaler = StandardScaler()
iris_scaled = scaler.fit_transform(iris.data)

2. PCA ์ ์šฉ ๋ฐ ์„ฑ๋Šฅ ๋น„๊ต

4๊ฐœ์˜ ํŠน์„ฑ์„ ๋‹จ 2๊ฐœ์˜ ์ฃผ์„ฑ๋ถ„์œผ๋กœ ์ค„์—ฌ๋ดค๋‹ค.

# 2์ฐจ์›์œผ๋กœ PCA ๋ณ€ํ™˜
pca = PCA(n_components=2)
iris_pca = pca.fit_transform(iris_scaled)

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ถ„๋ฅ˜๊ธฐ๋กœ ์„ฑ๋Šฅ ๋น„๊ต
rcf = RandomForestClassifier(random_state=156)

# ์›๋ณธ ๋ฐ์ดํ„ฐ ๊ต์ฐจ ๊ฒ€์ฆ
scores = cross_val_score(rcf, iris.data, iris.target, scoring='accuracy', cv=3)
print(f'์›๋ณธ ๋ฐ์ดํ„ฐ ํ‰๊ท  ์ •ํ™•๋„: {np.mean(scores):.2f}')

# PCA ๋ฐ์ดํ„ฐ ๊ต์ฐจ ๊ฒ€์ฆ
scores_pca = cross_val_score(rcf, iris_pca, iris.target, scoring='accuracy', cv=3)
print(f'PCA ๋ณ€ํ™˜ ๋ฐ์ดํ„ฐ ํ‰๊ท  ์ •ํ™•๋„: {np.mean(scores_pca):.2f}')

๊ฒฐ๊ณผ: ์›๋ณธ ๋ฐ์ดํ„ฐ ํ‰๊ท  ์ •ํ™•๋„๋Š” 0.96, PCA ๋ณ€ํ™˜ ๋ฐ์ดํ„ฐ ํ‰๊ท  ์ •ํ™•๋„๋Š” 0.89๊ฐ€ ๋‚˜์™”๋‹ค.
4๊ฐœ์˜ ํŠน์„ฑ์„ ๋‹จ 2๊ฐœ๋กœ ์ค„์˜€๋Š”๋ฐ๋„ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ํฌ์ง€ ์•Š์•˜๋‹ค. PCA๊ฐ€ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ฉฐ ์ฐจ์›์„ ํšจ๊ณผ์ ์œผ๋กœ ์ถ•์†Œํ–ˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ์„ฑ๊ณต! ๐ŸŽ‰


๐ŸŽจ ๋งค๋‹ˆํด๋“œ ํ•™์Šต๊ณผ t-SNE๋กœ ์‹œ๊ฐํ™”ํ•˜๊ธฐ

PCA๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ์„ ํ˜•์ ์ธ ๋ฐฉ๋ฒ•์ด๋ผ๋ฉด, t-SNE๋Š” ์ด์›ƒ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ณด์กดํ•˜๋ฉฐ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์›์ด๋‚˜ 3์ฐจ์›์œผ๋กœ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ฐ ํƒ์›”ํ•œ ๋น„์„ ํ˜• ๊ธฐ๋ฒ•์ด๋‹ค.

Scikit-learn์˜ ์ˆซ์ž ๋ฐ์ดํ„ฐ์…‹(64์ฐจ์›)์„ t-SNE๋กœ ์‹œ๊ฐํ™”ํ•ด๋ดค๋‹ค.

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ t-SNE ๋ณ€ํ™˜
digits = load_digits()
tsne = TSNE(n_components=2, init='pca', random_state=123)
X_digits_tsne = tsne.fit_transform(digits.data)

# t-SNE ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
def plot_projection(x, colors):
  f = plt.figure(figsize=(8,8))
  ax = plt.subplot(aspect='equal')
  for i in range(10):
    plt.scatter(x[colors==i, 0], x[colors==i, 1])
  for i in range(10):
    xtext, ytext = np.median(x[colors==i, :], axis=0)
    txt = ax.text(xtext, ytext, str(i), fontsize=24)
    txt.set_path_effects([PathEffects.Stroke(linewidth=5, foreground="w"), PathEffects.Normal()])

plot_projection(X_digits_tsne, digits.target)
plt.show()

๊ฒฐ๊ณผ๋ฅผ ๋ณด๋‹ˆ, ๊ฐ™์€ ์ˆซ์ž๋ผ๋ฆฌ ์˜น๊ธฐ์ข…๊ธฐ ๋ชจ์—ฌ ๊ตฐ์ง‘์„ ์ด๋ฃจ๋Š” ๋ชจ์Šต์ด ์ •๋ง ์‹ ๊ธฐํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ ์ˆจ๊ฒจ์ง„ ๊ตฌ์กฐ๋ฅผ ํ•œ๋ˆˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.


๐Ÿ› ๏ธ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ PCA์™€ ๋ชจ๋ธ๋ง์„ ํ•œ๋ฒˆ์—!

์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ์…‹(ํŠน์„ฑ 30๊ฐœ)์„ make_pipeline์œผ๋กœ ์ „์ฒ˜๋ฆฌ, ์ฐจ์› ์ถ•์†Œ, ๋ชจ๋ธ ํ•™์Šต์„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ–ˆ๋‹ค.

from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• 
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)

# ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression())
pipe.fit(X_train, y_train)

print(f"Train Score: {pipe.score(X_train, y_train):.4f}")
print(f"Test Score: {pipe.score(X_test, y_test):.4f}")

๊ฒฐ๊ณผ: 30๊ฐœ์˜ ํŠน์„ฑ์„ ๋‹จ 2๊ฐœ๋กœ ์ค„์˜€๋Š”๋ฐ๋„ ํ…Œ์ŠคํŠธ ์ •ํ™•๋„๊ฐ€ 97.37% ๋ผ๋Š” ๋†€๋ผ์šด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ ๋•๋ถ„์— ์ฝ”๋“œ๊ฐ€ ํ›จ์”ฌ ๊ฐ„๊ฒฐํ•˜๊ณ  ์ฒด๊ณ„์ ์œผ๋กœ ๋ณ€ํ–ˆ๋‹ค.


โœจ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

์˜ค๋Š˜์€ ๋น„์ง€๋„ ํ•™์Šต์˜ ๋Œ€ํ‘œ์ ์ธ ์ฐจ์› ์ถ•์†Œ ๊ธฐ๋ฒ•์ธ PCA์™€ t-SNE๋ฅผ ๊นŠ์ด ์žˆ๊ฒŒ ๋ฐฐ์› ๋‹ค. ๋‹จ์ˆœํžˆ ํŠน์„ฑ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์„ ๋„˜์–ด, ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋Š” โ€˜์ฃผ์„ฑ๋ถ„โ€™์„ ์ฐพ๋Š” PCA์˜ ์›๋ฆฌ์™€, ๋ฐ์ดํ„ฐ์˜ ๊ตญ์†Œ์  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋ฉฐ ์‹œ๊ฐํ™”ํ•˜๋Š” t-SNE์˜ ๊ฐ•๋ ฅํ•จ์„ ์ง์ ‘ ํ™•์ธํ–ˆ๋‹ค.

๋ถ“๊ฝƒ, MNIST, ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ์…‹ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹ค์Šต์„ ํ†ตํ•ด ์ฐจ์›์„ ํฌ๊ฒŒ ์ค„์—ฌ๋„ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋ณด๊ณ  ์ฐจ์› ์ถ•์†Œ์˜ ์œ„๋ ฅ์„ ์‹ค๊ฐํ–ˆ๋‹ค. ํŠนํžˆ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜๋‹ˆ ๋ณต์žกํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์•ž์œผ๋กœ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋งˆ์ฃผํ–ˆ์„ ๋•Œ, ๋ฌด์ž‘์ • ๋ชจ๋ธ์— ๋„ฃ๊ธฐ ์ „์— PCA์™€ t-SNE๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ๋ณธ์งˆ์„ ํŒŒ์•…ํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ์Šต๊ด€์„ ๋“ค์—ฌ์•ผ๊ฒ ๋‹ค. ์ฐจ์›์˜ ์ €์ฃผ, ์ด์   ๋‘๋ ต์ง€ ์•Š๋‹ค! ๐Ÿ˜„