๐Ÿค– ๋ฐ์ดํ„ฐ ์† ์ˆจ์€ ๊ทธ๋ฃน์„ ์ฐพ์•„๋‚ด๋Š” K-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜

์˜ค๋Š˜์€ ๋น„์ง€๋„ ํ•™์Šต์˜ ๋Œ€ํ‘œ์ ์ธ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜, K-ํ‰๊ท (K-Means) ์— ๋Œ€ํ•ด ๊นŠ์ด ํŒŒ๊ณ ๋“ค์—ˆ๋‹ค. ๋งˆ์น˜ ์ •๋ฆฌ๋˜์ง€ ์•Š์€ ๊ฑฐ๋Œ€ํ•œ ๋„์„œ๊ด€์—์„œ ๋น„์Šทํ•œ ์ฃผ์ œ์˜ ์ฑ…๋“ค์„ ์ฐพ์•„ ์„œ๊ฐ€๋ฅผ ์ •๋ฆฌํ•˜๋Š” ์‚ฌ์„œ์ฒ˜๋Ÿผ, K-ํ‰๊ท ์€ ๋ฐ์ดํ„ฐ์˜ ์ˆจ๊ฒจ์ง„ ๊ตฌ์กฐ๋ฅผ ์ฐพ์•„ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์–ด์ฃผ๋Š” ๋ฉ‹์ง„ ์—ญํ• ์„ ํ•œ๋‹ค.

โœจ โ€œ๋ฐ์ดํ„ฐ์˜ ์นœ๊ตฌ๋“ค์„ ์ฐพ์•„์ฃผ์ž!โ€

K-ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ •๋‹ต์ด ์—†๋Š” ๋ฐ์ดํ„ฐ ์†์—์„œ๋„ ์˜๋ฏธ ์žˆ๋Š” ํŒจํ„ด๊ณผ ๊ทธ๋ฃน์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ› ๏ธ K-ํ‰๊ท (K-Means) vs. K-์ตœ๊ทผ์ ‘ ์ด์›ƒ(K-NN)

์ด๋ฆ„์ด ๋น„์Šทํ•ด์„œ ํ—ท๊ฐˆ๋ฆด ์ˆ˜ ์žˆ์ง€๋งŒ, ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์™„์ „ํžˆ ๋‹ค๋ฅธ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

๊ตฌ๋ถ„ K-ํ‰๊ท  (K-Means) K-์ตœ๊ทผ์ ‘ ์ด์›ƒ (K-NN)
ํ•™์Šต ๋ฐฉ์‹ ๋น„์ง€๋„ํ•™์Šต (Clustering) ์ง€๋„ํ•™์Šต (Classification)
๋ชฉํ‘œ ๋ ˆ์ด๋ธ” ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ K๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ๋‚˜๋ˆ” ์ƒˆ ๋ฐ์ดํ„ฐ์˜ ํด๋ž˜์Šค๋ฅผ K๊ฐœ์˜ ์ด์›ƒ์„ ํ†ตํ•ด ๊ฒฐ์ •
๊ณตํ†ต์  K๊ฐœ์˜ ์ ์„ ์ง€์ •ํ•˜๊ณ , ๊ฑฐ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ ย 

โš™๏ธ K-ํ‰๊ท ์€ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ• ๊นŒ?

K-ํ‰๊ท ์€ ์ค‘์‹ฌ์ (Centroid)์ด๋ผ๋Š” ํŠน๋ณ„ํ•œ ์ ์„ ๊ณ„์† ์ด๋™์‹œํ‚ค๋ฉด์„œ ์ตœ์ ์˜ ๊ตฐ์ง‘์„ ์ฐพ์•„๋‚ธ๋‹ค.

  1. ์ดˆ๊ธฐํ™”: K๊ฐœ์˜ ์ค‘์‹ฌ์ ์„ ์ž„์˜์˜ ์œ„์น˜์— ๋ฐฐ์น˜ํ•œ๋‹ค.
  2. ๊ตฐ์ง‘ ํ• ๋‹น: ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ค‘์‹ฌ์ ์— ํ• ๋‹นํ•œ๋‹ค.
  3. ์ค‘์‹ฌ์  ์—…๋ฐ์ดํŠธ: ๊ฐ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์œผ๋กœ ์ค‘์‹ฌ์ ์„ ์ด๋™์‹œํ‚จ๋‹ค.
  4. ๋ฐ˜๋ณต: ์ค‘์‹ฌ์ ์ด ๋” ์ด์ƒ ์›€์ง์ด์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ 2, 3๋ฒˆ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉด, ๋งˆ์นจ๋‚ด ์ตœ์ ์˜ ๊ตฐ์ง‘์ด ํƒ„์ƒํ•œ๋‹ค! ๐Ÿœ

๐Ÿ Scikit-learn์œผ๋กœ K-ํ‰๊ท  ๊ตฌํ˜„ํ•˜๊ธฐ

๋ถ“๊ฝƒ(Iris) ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ง์ ‘ K-ํ‰๊ท  ๊ตฐ์ง‘ํ™”๋ฅผ ์‹ค์Šตํ•ด๋ดค๋‹ค.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

# 1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
irisDF = pd.DataFrame(iris.data, columns=iris.feature_names)

# 2. K-ํ‰๊ท  ๋ชจ๋ธ ํ•™์Šต (K=3)
kmeans = KMeans(n_clusters=3, random_state=121)
kmeans.fit(irisDF)

# 3. ๊ตฐ์ง‘ ๊ฒฐ๊ณผ ํ™•์ธ
print(kmeans.labels_)
# [1 1 1 ... 0 0 0]

inertia_ ์†์„ฑ์€ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ฐ์ดํ„ฐ๋“ค์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ๋ญ‰์ณ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, K๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋ฌด์กฐ๊ฑด ์ž‘์•„์ ธ์„œ ์ตœ์ ์˜ K๋ฅผ ํŒ๋‹จํ•˜๋Š” ์ ˆ๋Œ€์ ์ธ ๊ธฐ์ค€์ด ๋˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. ๐Ÿค”

์ตœ์ ์˜ K ์ฐพ๊ธฐ: ์—˜๋ณด์šฐ(Elbow) ๊ธฐ๋ฒ•

๊ทธ๋ ‡๋‹ค๋ฉด ์ตœ์ ์˜ K๋Š” ์–ด๋–ป๊ฒŒ ์ฐพ์„๊นŒ? ๋ฐ”๋กœ ์—˜๋ณด์šฐ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ๋‹ค. K๋ฅผ ๋Š˜๋ ค๊ฐ€๋ฉฐ inertia ๊ฐ’์˜ ๋ณ€ํ™”๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ธ์„ ๋•Œ, ํŒ”๊ฟˆ์น˜์ฒ˜๋Ÿผ ๊ธ‰๊ฒฉํžˆ ๊บพ์ด๋Š” ์ง€์ ์ด ๋ฐ”๋กœ ์ตœ์ ์˜ K๊ฐ€ ๋œ๋‹ค.

# ์—˜๋ณด์šฐ ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ ์˜ K ์ฐพ๊ธฐ
inertias = []
K_range = range(1, 12)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=121)
    km.fit(iris.data)
    inertias.append(km.inertia_)

# ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™”
plt.plot(K_range, inertias, '-o')
plt.xlabel('Number of clusters, K')
plt.ylabel('Inertia')
plt.show()

๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋‹ˆ 3์—์„œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์™„๋งŒํ•ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค.


๐Ÿ‘ K-ํ‰๊ท ์˜ ์žฅ์ ๊ณผ ๋‹จ์  ๐Ÿ‘Ž

์žฅ์  ๋‹จ์ 
์ดํ•ด์™€ ๊ตฌํ˜„์ด ์‰ฝ๊ณ  ์ง๊ด€์ ์ด๋‹ค. K๊ฐ’์„ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•œ๋‹ค. ๐Ÿ˜ญ
์ˆ˜๋ ด์ด ๋ณด์žฅ๋œ๋‹ค. ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์ด๋ผ ์ฐจ์›์ด ๋งŽ์•„์ง€๋ฉด ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค.
๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์—๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด์ƒ์น˜(Outlier)์™€ ์Šค์ผ€์ผ์— ๋ฏผ๊ฐํ•˜๋‹ค.
- ์ดˆ๊ธฐ ์ค‘์‹ฌ์  ์œ„์น˜์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๊ธฐ๋ฒ•๋“ค

K-ํ‰๊ท ์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค๋„ ์ ์šฉํ•ด๋ดค๋‹ค.

  • PCA (์ฐจ์› ์ถ•์†Œ): ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ค„์—ฌ์ฃผ์—ˆ๋‹ค.
  • ์Šค์ผ€์ผ๋ง (Scaling): StandardScaler๋กœ ํ”ผ์ฒ˜ ์Šค์ผ€์ผ์„ ๋งž์ถ”๋‹ˆ ์ด์ƒ์น˜์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# ์Šค์ผ€์ผ๋ง
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

# PCA๋กœ 2์ฐจ์›์œผ๋กœ ์ถ•์†Œ
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# K-Means ๋‹ค์‹œ ํ•™์Šต
kmeans_pca = KMeans(n_clusters=3, random_state=121)
kmeans_pca.fit(X_pca)

๐Ÿ“Š ๊ตฐ์ง‘ํ™”, ์–ผ๋งˆ๋‚˜ ์ž˜ ๋์„๊นŒ? (์„ฑ๋Šฅ ํ‰๊ฐ€)

1. ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ (Silhouette Coefficient)

  • ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด ์—†์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์ง€ํ‘œ๋‹ค.
  • ๊ตฐ์ง‘ ๋‚ด ๋ฐ์ดํ„ฐ๋Š” ์–ผ๋งˆ๋‚˜ ๊ฐ€๊น๊ณ , ๋‹ค๋ฅธ ๊ตฐ์ง‘๊ณผ๋Š” ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ์ธก์ •ํ•œ๋‹ค.
  • 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ข‹์€ ๊ตฐ์ง‘ํ™”๋กœ ํŒ๋‹จํ–ˆ๋‹ค.
from sklearn.metrics import silhouette_score

score = silhouette_score(iris.data, kmeans.labels_)
print('์‹ค๋ฃจ์—ฃ ์ ์ˆ˜: {0:.3f}'.format(score))
# ์‹ค๋ฃจ์—ฃ ์ ์ˆ˜: 0.553

2. Homogeneity, Completeness, V-measure

  • ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด ์žˆ์„ ๋•Œ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • Homogeneity (๊ท ์งˆ์„ฑ): ๊ฐ ๊ตฐ์ง‘์ด ๋™์ผํ•œ ์‹ค์ œ ํด๋ž˜์Šค๋กœ๋งŒ ๊ตฌ์„ฑ๋œ ์ •๋„.
  • Completeness (์™„์ „์„ฑ): ์‹ค์ œ ํด๋ž˜์Šค์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ๋ชจ๋‘ ๋™์ผํ•œ ๊ตฐ์ง‘์— ์†ํ•œ ์ •๋„.
  • V-measure: ์œ„ ๋‘ ๊ฐ’์˜ ์กฐํ™” ํ‰๊ท ์ด๋‹ค.
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score

print("๊ท ์งˆ์„ฑ:", homogeneity_score(iris.target, kmeans.labels_))
print("์™„์ „์„ฑ:", completeness_score(iris.target, kmeans.labels_))
print("V-measure:", v_measure_score(iris.target, kmeans.labels_))

๐Ÿง K-ํ‰๊ท ์˜ ๋Œ€์•ˆ: ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง (DBSCAN)

K-ํ‰๊ท ์€ ์›ํ˜• ๊ตฐ์ง‘์— ๊ฐ•ํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ๋ชจ์–‘์˜ ๊ตฐ์ง‘์€ ์ž˜ ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•œ๋‹ค. ์ด๋Ÿด ๋•Œ DBSCAN์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ€์ง‘๋œ ์ง€์—ญ์„ ์ฐพ์•„ ๊ตฐ์ง‘ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
  • K-ํ‰๊ท ๊ณผ ๋‹ฌ๋ฆฌ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ๋ฏธ๋ฆฌ ์ •ํ•  ํ•„์š”๊ฐ€ ์—†๊ณ , ๋…ธ์ด์ฆˆ(์ด์ƒ์น˜)๋ฅผ ์ž๋™์œผ๋กœ ๋ถ„๋ฅ˜ํ•ด์ค˜์„œ ํŽธ๋ฆฌํ–ˆ๋‹ค.
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.6, min_samples=8)
dbscan_labels = dbscan.fit_predict(iris.data)

# -1์€ ๋…ธ์ด์ฆˆ(์ด์ƒ์น˜)๋กœ ๋ถ„๋ฅ˜๋œ ๋ฐ์ดํ„ฐ
print(np.unique(dbscan_labels, return_counts=True))

โœจ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

K-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ๊ฐ•๋ ฅํ•œ ๊ตฐ์ง‘ํ™” ๊ธฐ๋ฒ•์ด์—ˆ๋‹ค. ํŠนํžˆ ์—˜๋ณด์šฐ ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ ์˜ K๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •์ด ํฅ๋ฏธ๋กœ์› ๋‹ค. ํ•˜์ง€๋งŒ K๊ฐ’์„ ์ง์ ‘ ์ •ํ•ด์•ผ ํ•˜๊ณ , ์ด์ƒ์น˜์— ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ๋‹จ์ ๋„ ๋ช…ํ™•ํžˆ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค.

๋‹ค์Œ์—๋Š” ์˜ค๋Š˜ ๋ฐฐ์šด DBSCAN์ฒ˜๋Ÿผ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด๋‚˜ ๋‹ค๋ฅธ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋” ๊นŠ์ด ํŒŒ๊ณ ๋“ค์–ด, ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์— ๋งž๋Š” ์ตœ์ ์˜ ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ธธ๋Ÿฌ์•ผ๊ฒ ๋‹ค. ๐Ÿ˜„