มาทำ Cross-Validation ใน python

ในบทความนี้ แอดจะพาไปใช้ cross-validation ในวิธีการต่างๆ แล้วดูว่ามันสำคัญยังไงในสายงาน data science และงาน Machine Learning ในทุกวันนี้

เรารู้ว่าการจะทำ ML นั้นมีความต้องการข้อมูลตั้งต้นในอดีตเป็นขา input ในจำนวนเยอะมากในการ train แล้วมันทำงานได้ดีกับข้อมูลในที่ถูกสร้างขึ้น real-time ในยุคนี้หรือไม่

ซึ่งข้อมูลขา input นั้นเราต้องเชื่อมั่นว่ามันจะเป็นตัวแทน และใช้พยากรณ์อนาคตได้ อย่างเช่น หากเราใช้ข้อมูลย้อนหลังถึง 20 ปี ต้องตั้งคำถามว่าข้อมูล 20 ปีที่แล้วมันจะสามารถพยากรณ์อนาคตได้ดีหรือไม่—ตอบคำถามนี้ให้ได้ก่อน

เมื่อข้อมูลที่เรามีนั้นเราคิดว่ามันใช้ทำโมเดลได้ Cross-validation จึงเข้ามาเป็นตัวทดสอบว่าโมเดลที่เราสร้างขึ้นมาใช้งานนั้นมัน work จริงมั้ย โดยการแบ่งข้อมูล (splitting data) ออกเป็นส่วนๆ แล้ววันทดสอบซ้ำๆ

ใช้ส่วนนึงไปเรียนรู้ (training) แล้วทดสอบความแม่นยำกับส่วนที่เหลือ (testing) ช่วยลดอาการที่โมเดลทำงานได้ดีเกินไปกับชุดข้อมูลในอดีต แต่ข้อมูลใหม่แย่ (overfitting) หรืออาการที่โมเดลทำงานได้ไม่ดีตั้งแต่เริ่ม หาความสัมพันธ์ของขา input และ output ไม่ได้เลย (underfitting)

ครั้งนี้แอดจะพาไปเรียนรู้ cross-validation กับเบสิคพื้นฐาน, ประเภทต่างๆ และวิธีการใช้งานให้โมเดลของเรานั้นทำงานได้ดีมากขึ้น ดังหัวข้อนี้:

ทำความเข้าใจกันก่อน
อะไรคือ Cross-Validation
K-Fold Cross-Validation
Leave-One-Out Cross-Validation
Bootstrap Cross-Validation

ทำความเข้าใจกันก่อน

ก่อนที่เราจะไปหัวข้อต่อไป ในบทความนี้จะกล่าวข้ามในบางเรื่อง ดังนั้น make sure ก่อนนะครับว่า คุณผู้อ่านเข้าใจเรื่องนี้มาบ้างแล้ว:

พื้นฐาน Machine learning อย่างคอนเซ็ป overfitting, underfitting หรือวิธีการวัดผล model อย่าง MAE, RMSE, Accuracy ใดๆ
ทักษะการเรียกใช้งาน library อย่าง Scikit-learn, Pandas, หรือ NumPy
การเตรียมข้อมูลเพื่อใช้งานในโมเดล อย่างการ split data เพื่อใช้ training/testing
และหากจะทำตามบทความนี้ make sure ว่าติดตั้งโมเดลเหล่านี้ในเครื่องคอมฯ ก่อนนะครับ: numpy, pandas, scikit-learn, matplotlib

อะไรคือ Cross-Validation

Cross-Validation เป็น 1 ในวิธีการสุ่มตัวอย่าง (resampling) เพื่อประเมินความสามารถในการสรุปผลของโมเดล predictive ที่ได้ (ไปยังประชากรทั้งหมด) และเพื่อป้องกันการเกิด overfitting ด้วย (ก่อนหน้านี้แอดเคยจำว่า CV คือการตรวจสอบแบบไขว้)

ต่างจากการ split data ธรรมดา โดย cross-validation ให้ความยืดหยุ่น ให้ความเข้าใจข้อมูลได้ครอบคลุมมากกว่า เพราะมันทำการสับเปลี่ยนบทบาทระหว่าง training sets และ testing sets

เพื่อให้แต่ละ data point ได้เป็นทั้งตัว training เป็นทั้งตัว testing ให้กับโมเดล ทั้งยังเพิ่มประสิทธิภาพการทำงานของโมเดลอีกด้วย โดยคีย์การทำงานของ cross-validation หลักๆ คือ:

ประเมินประสิทธิภาพการทำงานของโมเดล
ลด bias เพราะทำการทดสอบกับ data sets ที่หลากหลาย
สามารถ custom hyperparameters ต่างๆ ได้ จากรอบการทำงานของ cross-validation

Cross validation อาศัยทฤษฎี “bias-variance tradeoff” ซึ่งอธิบายว่าโมเดลการเรียนรู้ของเครื่องทุกโมเดลจะมี bias (ความผิดพลาดจากตัวโมเดล) และ variance (ความผิดพลาดจากข้อมูล) อยู่เสมอ การ cross validation ช่วยให้ลด variance ของโมเดลได้

และ cross-validation ก็มีหลายแบบ แต่ละอย่างก็มีโครสร้าง เทคนิควิธีการ ต่างกันออกไป มาดูกันว่า 3 แบบหลักๆ มันเป็นยังไง

แอดจะใช้ข้อมูลจาก DM Logistic Regression จากบทความนี้ มาทำการ cv ต่อนะ คุณผู้อ่านอาจจะต้องทำความเข้าใจเรื่องนี้มาก่อน และเนื่องจากครั้งนั้นเป็นการ Holdout validation: แบ่งข้อมูลออกเป็น training set และ test set เพียงชุดเดียว (วิธีนี้ง่ายที่สุด) แต่มีความเสี่ยงสูงว่าโมเดลอาจ overfit กับ training set

1. K-Fold Cross-Validation

การทำงาน

ทำการแบ่ง dataset ออกเป็น k ส่วน (folds) ปกติเรานิยมใช้ค่า K=5 หรือ K=10
โมเดลจะเรียนรู้ (train) กับส่วนที่ 1 (k-1) และทดสอบ (test) กับส่วนที่เหลือทั้งหมด
กระบวนการข้อ 2 นี้จะทำซ้ำกับ k-n ที่เหลือไปทีละส่วนจนครบทุก k ที่เราตั้งไว้
สุดท้ายก็ทดสอบประสิทธิภาพโดยใช้ค่าเฉลี่ยของแต่ละรอบที่รันผลลัพธ์ได้

ดียังไง

ทำงานได้ดีกับ(เกือบ)ทุก datasets
ลดความแปรปรวนโดยใช้ค่าเฉลี่ยของผลลัพธ์ (average)

ข้อควรพิจารณา

สิ่งสำคัญคือการเลือกค่า k ที่เหมาะสม โดยปกติจะเป็น 5 หรือ 10

from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

# กำหนดตัวแปรต้น (X) และตัวแปรตาม (y)
X = dm[['pregnancies', 'glucose', 'bloodpressure', 'bmi', 'diabetespedigreefunction']]
y = dm['outcome']

# เตรียม cross-validation: shuffle-สุ่มข้อมูลก่อนแบ่ง fold, กำหนด random_state ให้ผลลัพธ์การแบ่งเหมือนกันทุกครั้งที่รันแบ่งข้อมูล, n_splits-จำนวน folds ที่จะแบ่ง
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)

accuracies = []
errors = []

for train_index, test_index in kf.split(X):
    # .iloc เลือก rows ตาม integer (index) location
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    fold_accuracy = accuracy_score(y_test, predictions) # คำนวณ accuracy สำหรับ fold นี้ 
    
    accuracies.append(fold_accuracy)

    fold_error = 1 - fold_accuracy # คำนวณ error สำหรับ fold นี้ (Error = 1 - Accuracy)
    errors.append(fold_error)

print("Accuracies for each fold:", np.round(accuracies, 4))
print("Average Accuracy:", np.round(np.mean(accuracies), 4))
print("Standard Deviation of Accuracy:", np.round(np.std(accuracies), 4))

print("Errors for each fold:", np.round(errors, 4))
print("Average Error:", np.round(np.mean(errors), 4))
print("Standard Deviation of Error:", np.round(np.std(errors), 4))

Accuracies for each fold: [0.7727 0.7792 0.7468 0.817  0.7451]
Average Accuracy: 0.7722
Standard Deviation of Accuracy: 0.0262

Errors for each fold: [0.2273 0.2208 0.2532 0.183  0.2549]
Average Error: 0.2278
Standard Deviation of Error: 0.0262

การเลือกใช้ Accuracy vs. Error
Accuracy: บอกสัดส่วนของข้อมูลที่โมเดลทำนายได้ถูกต้อง ยิ่งสูงยิ่งดี (เข้าใกล้ 1 หรือ 100%)
Error: บอกสัดส่วนของข้อมูลที่โมเดลทำนายผิด ยิ่งต่ำยิ่งดี (เข้าใกล้ 0 หรือ 0%)

2. Leave-One-Out Cross-Validation

การทำงาน

Leave-One-Out Cross-Validation หรือ LOOC จะใช้ทุก data point (1 เรคคอร์ด) เป็นตัวทดสอบ (test) และใช้ข้อมูลที่เหลือทั้งหมดเป็นตัวเรียนรู้ (train) ทำให้เป็นวิธีที่ใช้ทรัพยากรในการทำงานสูงมาก จึงไม่เหมาะกับ dataset ที่มีขนาดใหญ่

ดียังไง

LOOCV เหมาะสมที่สุดสำหรับชุดข้อมูลที่มีขนาดเล็ก (เช่น N น้อยกว่าหลักร้อย หรือไม่กี่ร้อย) เนื่องจากค่าใช้จ่ายในการคำนวณที่สูง
LOOCV เป็นการประเมินโมเดลที่มี Bias ต่ำที่สุด เพราะการรันโมเดลแต่ละครั้งจะได้รับการฝึกด้วยข้อมูลเกือบทั้งหมด

ข้อควรพิจารณา

สำหรับชุดข้อมูลขนาดใหญ่ (เช่น N เป็นพันหรือหมื่น) LOOCV จะใช้เวลาในการคำนวณนานมากจนไม่สามารถทำได้จริง
แม้ว่าจะมี Bias ต่ำ แต่ค่า Variance ของผลลัพธ์อาจจะสูงกว่า K-Fold เพราะ Test Set มีขนาดเล็กมาก (แค่ 1 จุด)

from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

X = dm[['pregnancies', 'glucose', 'bloodpressure', 'bmi', 'diabetespedigreefunction']]
y = dm['outcome']

# LeaveOneOut ไม่ต้องระบุ n_splits เพราะจะถูกกำหนดให้เป็นจำนวนข้อมูลทั้งหมดโดยอัตโนมัติ
loo = LeaveOneOut()
model = LogisticRegression(max_iter=200)
accuracies = []
errors = []

for train_index, test_index in loo.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    fold_accuracy = accuracy_score(y_test, predictions)
    accuracies.append(fold_accuracy)

    fold_error = 1 - fold_accuracy # คำนวณ error สำหรับ fold นี้ (Error = 1 - Accuracy)
    errors.append(fold_error)

print(f"Number of folds (samples): {len(accuracies)}") # จะเท่ากับจำนวนข้อมูลทั้งหมด
print("Average Accuracy (LOOCV):", np.round(np.mean(accuracies), 4))
print("Standard Deviation of Accuracy (LOOCV):", np.round(np.std(accuracies), 4))
print("Average Error (LOOCV):", np.round(np.mean(errors), 4))
print("Standard Deviation of Error (LOOCV):", np.round(np.std(errors), 4))

Number of folds (samples): 768
Average Accuracy (LOOCV): 0.7682
Standard Deviation of Accuracy (LOOCV): 0.422

Average Error (LOOCV): 0.2318
Standard Deviation of Error (LOOCV): 0.422

3. Bootstrap Cross-Validation

การทำงาน

โดยเป็นการสุ่มตัวอย่างแบบใส่คืน (Sampling with Replacement) จากชุดข้อมูลเดิมขนาด N เราจะทำการสุ่มตัวอย่างจำนวน N จุดออกมา โดยมีการใส่คืน (replacement) นั่นหมายความว่าจุดข้อมูลเดียวกันสามารถถูกเลือกได้หลายครั้งใน sample เดียวกัน
เราจะทำซ้ำขั้นตอนการสุ่มนี้หลายๆ ครั้ง (เช่น 100, 500, หรือ 1000 ครั้ง) เพื่อสร้างชุดข้อมูล “bootstrap samples” หลายชุด
การประเมินประสิทธิภาพ จะประเมินบน “Out-of-Bag (OOB)” samples (ข้อมูลที่ไม่ได้ถูกเลือกเข้าใน bootstrap sample นั้นๆ) ซึ่งแยกไว้ต่างหาก
ผลลัพธ์จากแต่ละ bootstrap sample จะถูกนำมาเฉลี่ย หรือนำมาสร้าง Confidence Interval เพื่อประเมินประสิทธิภาพโดยรวมของโมเดล

ความแตกต่างจาก K-Fold และ LOOCV:
K-Fold/LOOCV: แบ่งข้อมูลออกเป็นส่วนๆ โดยไม่มีการซ้ำซ้อนกันระหว่าง Training และ Test Set ในแต่ละ Fold
Bootstrap: สุ่มตัวอย่างแบบใส่คืน ทำให้ Training Set (bootstrap sample) อาจมีข้อมูลซ้ำกัน และ Test Set (OOB) จะประกอบด้วยข้อมูลที่ไม่ได้ถูกสุ่มเข้ามา

ดียังไง

Bootstrap มีประโยชน์มากในการประมาณค่า Confidence Interval (ช่วงความเชื่อมั่น) ของเมตริกประสิทธิภาพ หรือของพารามิเตอร์โมเดล เนื่องจากเราได้ผลลัพธ์จากการทำซ้ำหลายๆ ครั้ง

ข้อควรพิจารณา

Bias สูงกว่า LOOCV: เนื่องจากแต่ละ bootstrap sample ไม่ได้ครอบคลุมข้อมูลทั้งหมด
Variance ต่ำกว่า LOOCV: เนื่องจาก OOB test set มีขนาดใหญ่กว่า 1 จุด และการทำซ้ำหลายครั้งช่วยให้ค่าประมาณมั่นคงขึ้น

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
import numpy as np
import pandas as pd

model = LogisticRegression(max_iter=200)
n_iterations = 1000 # จำนวนครั้งของการทำ Bootstrap
accuracies = []
errors = []

# ทำ Bootstrap Cross-Validation
for i in range(n_iterations):
    # 3. สร้าง bootstrap sample สำหรับ train
    X_train_bootstrap, y_train_bootstrap = resample(
        X, y, replace=True, n_samples=len(X), random_state=i # ใช้ i เป็น seed สำหรับแต่ละ iteration
    )
    # 4. ระบุ index ให้กับ "Out-of-Bag" (OOB) สำหรับ test
    # สร้าง index ของข้อมูลทั้งหมด
    all_indices = set(range(len(X)))
    
    # หา index ของข้อมูลที่ถูกใช้ใน bootstrap sample
    train_indices = set(X_train_bootstrap.index)

    # หา index ของข้อมูลที่ไม่ได้ถูกใช้ (OOB indices)
    oob_indices = list(all_indices - train_indices)

    # ตรวจสอบว่ามี OOB samples หรือไม่
    if len(oob_indices) == 0:
        # ถ้าไม่มี OOB samples เกิดขึ้นให้ข้าม iteration นี้
        print(f"Warning: No OOB samples for iteration {i}. Skipping.")
        continue

    X_test_oob = X.iloc[oob_indices] # ใช้ X เดิม
    y_test_oob = y.iloc[oob_indices] # ใช้ y เดิม
    
    # 5. Train model สำหรับ bootstrap sample
    model.fit(X_train_bootstrap, y_train_bootstrap)
    
    # 6. ทดสอบกับ OOB test set
    predictions = model.predict(X_test_oob)
    
    fold_accuracy = accuracy_score(y_test_oob, predictions)
    fold_error = 1 - fold_accuracy
    
    accuracies.append(fold_accuracy)
    errors.append(fold_error)

print(f"Number of Bootstrap iterations: {n_iterations}")
print("Average Accuracy (Bootstrap):", np.round(np.mean(accuracies), 4))
print("Standard Deviation of Accuracy (Bootstrap):", np.round(np.std(accuracies), 4))
print("Average Error (Bootstrap):", np.round(np.mean(errors), 4))
print("Standard Deviation of Error (Bootstrap):", np.round(np.std(errors), 4))

Number of Bootstrap iterations: 1000

Average Accuracy (Bootstrap): 0.7689
Standard Deviation of Accuracy (Bootstrap): 0.0201

Average Error (Bootstrap): 0.2311
Standard Deviation of Error (Bootstrap): 0.0201

สรุปการทำงานของ Cross Validation

ข้อดีของ Cross Validation คือช่วยลดความเสี่ยงของ overfitting ทำให้ประเมินประสิทธิภาพของโมเดลได้อย่างแม่นยำ โมเดลมีความน่าเชื่อถือมากขึ้น เราจึงสามารเลือกเลือกโมเดลที่ดีที่สุดสำหรับงานได้ ส่วนข้อเสียนั้น อาจใช้เวลานานและอาจใช้ทรัพยากรคอมพิวเตอร์มากในการรันแต่ละครั้ง

และแน่นนอนว่าเราต้องเลือกเทคนิคการ Cross Validation ที่ถูกต้อง สอดคล้องกับของชุดข้อมูลและประเภทของปัญหา ข้อมูลสำหรับ test ไม่ควรหลุดรั่วไปหาชุด train และในความเป็นจริงแล้วเราต้องดูความหมาะสมของต้นทุนที่ใช้ในการคำนวณและความละเอียดของการรันโมเดล เช่น LOOCV ที่ในเวลานาน อาจจะใช้ต้นทุนในการทำงานที่สูง

No Free Lunch แปลว่า “ไม่มีโมเดลไหนเก่งที่สุด และสามารถตอบโจทย์ได้ทุกปัญหา”
ถ้ามีใครถามว่าโมเดลไหนเก่งที่สุด? ให้ตอบว่า “It depends” (ขึ้นอยู่กับข้อมูล)
ความท้าทายของ ML คือการหาโมเดลที่ดีที่สุดสําหรับปัญหาที่เรากําลังแก้ – DataRockie

ขอบคุณครับ

Feasible

มาลองฝึกทำ Cross-Validation ใน python กัน

ทำความเข้าใจกันก่อน

อะไรคือ Cross-Validation

1. K-Fold Cross-Validation

2. Leave-One-Out Cross-Validation

3. Bootstrap Cross-Validation

สรุปการทำงานของ Cross Validation

Leave a Reply Cancel reply

Search

About

Archive

Categories

Recent Posts

Tags

Social Icons