๐‘๐‘œ๐‘ก๐‘’๐‘๐‘œ๐‘œ๐‘˜

๋จธ์‹ ๋Ÿฌ๋‹ ์ƒ˜ํ”Œ๋ง ๋ณธ๋ฌธ

ํ”„๋กœ์ ํŠธ๐Ÿ 

๋จธ์‹ ๋Ÿฌ๋‹ ์ƒ˜ํ”Œ๋ง

seoa__ 2025. 1. 31. 21:10

[ ๋ชฉ์ฐจ ]

     

     

    ์ƒ˜ํ”Œ๋ง (sampling)

    : ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ

     

    ์ข…๋ฅ˜

    1๏ธโƒฃ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง (OverSampling)

    2๏ธโƒฃ ์–ธ๋”์ƒ˜ํ”Œ๋ง (UnderSampling)

    3๏ธโƒฃํ˜ผํ•ฉ ์ƒ˜ํ”Œ๋ง ( Over + Under )

     

    1๏ธโƒฃ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง (OverSampling)

    : ์†Œ์ˆ˜ ํด๋ž˜์Šค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๊ธฐ๋ฒ•

    (์˜ˆ : ํ—ˆ์œ„๋งค๋ฌผ์ด ์ ์€ ๊ฒฝ์šฐ, ํ—ˆ์œ„๋งค๋ฌผ ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์œ„์ ์œผ๋กœ ์ƒ์„ฑ)

     

    ์ฃผ์š” ๊ธฐ๋ฒ•

    • Random OverSampling 
      • ์†Œ์ˆ˜ ํด๋ž˜์Šค ์ƒ˜ํ”Œ์„ ๋‹จ์ˆœ ๋ณต์ œ
        • ์žฅ์  : ์‰ฝ๊ณ  ๋น ๋ฆ„
        • ๋‹จ์  : ์ค‘๋ณต ๋ฐ์ดํ„ฐ๋กœ ๊ณผ์ ‘ํ•ฉ ์œ„ํ—˜
    • SMOTE (Synthetic Minority Over-sampling Technique)
      • ์ตœ๊ทผ์ ‘ ์ด์›ƒ(KNN) ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
        • ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์ด ์ฆ๊ฐ€
        • ์ค‘๋ณต ๋ฐ์ดํ„ฐ๋กœ ๊ณผ์ ‘ํ•ฉ ์œ„ํ—˜
    • ADASYN (Adaptive Synthetic Sampling)
      • SMOTE์—์„œ ์ƒ˜ํ”Œ๋ง ๋น„์œจ์„ ์กฐ์ •ํ•˜์—ฌ ๊ทน๋‹จ์ €๊ธ๋กœ ๋ถ€์กฑํ•œ ํด๋ž˜์Šค์— ์ง‘์ค‘
        • ์žฅ์  : ํ•™์Šต ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
        • ๋…ธ์ด์ฆˆ๊ฐ€ ์ถ”๊ฐ€ ๋  ์œ„ํ—˜
    ๐Ÿ“Œ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ์ฝ”๋“œ ์˜ˆ์‹œ (ADASYN)
    
    from imblearn.over_sampling import ADASYN
    
    adasyn = ADASYN(random_state=42)
    X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)

    2๏ธโƒฃ ์–ธ๋”์ƒ˜ํ”Œ๋ง (UnderSampling)

    : ๋‹ค์ˆ˜ ํด๋ž˜์Šค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ค„์—ฌ์„œ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๊ธฐ๋ฒ•

    ( ex : ์ •์ƒ ๋งค๋ฌผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์ผ๋ถ€๋ฅผ ์ œ๊ฑฐ )

    ์ฃผ์š” ๊ธฐ๋ฒ•

    • Random UnderSampling
      • ๋‹ค์ˆ˜ ํด๋ž˜์Šค ์ƒ˜ํ”Œ์„ ๋ฌด์ž‘์œ„๋กœ ์ œ๊ฑฐ
        • ์žฅ์  : ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฆ„
        • ๋‹จ์  : ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ๋ผ์งˆ ์œ„ํ—˜
    • Tomek Links
      • ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ(Tomek link)๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ƒ˜ํ”Œ ๊ท ํ˜• ์กฐ์ •
        • ์žฅ์  :  ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
        • ๋‹จ์  :  ๋ฐ์ดํ„ฐ ์†์‹ค ๊ฐ€๋Šฅ์„ฑ
    • NearMiss
      • ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋งŒ ๋‚จ๊ธฐ๊ณ  ๋‚˜๋จธ์ง€๋ฅผ ์ œ๊ฑฐ
        • ์žฅ์  : ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜ ์ค„์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์œ ์ง€
        • ๋‹จ์  : ๋ฐ์ดํ„ฐ ์†์‹ค ๋ฐ ์ •๋ณด ๋ถ€์กฑ ๊ฐ€๋Šฅ
    ๐Ÿ“Œ ์–ธ๋”์ƒ˜ํ”Œ๋ง ์ฝ”๋“œ ์˜ˆ์‹œ (NearMiss)
    
    from imblearn.under_sampling import NearMiss
    
    nearmiss = NearMiss()
    X_train_resampled, y_train_resampled = nearmiss.fit_resample(X_train, y_train)

    3๏ธโƒฃํ˜ผํ•ฉ ์ƒ˜ํ”Œ๋ง ( Over + Under )

    : ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง๊ณผ ์–ธ๋”์ƒ˜ํ”Œ๋ง์„ ํ•จ๊ป˜ ์ ์šฉํ•˜์—ฌ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๋ฐฉ๋ฒ•

    ( ex : smote + tomek links ์‚ฌ์šฉ)

    ์ฃผ์š” ๊ธฐ๋ฒ•

    • SMOTE + Tomek Links
      • SMOTE๋กœ ์†Œ์ˆ˜ ํด๋ž˜์Šค ์ฆ๊ฐ€ ํ›„, Tomek Links๋กœ ๋‹ค์ˆ˜ ํด๋ž˜์Šค ์ •๋ฆฌ
        • ์žฅ์  : ๋ฐ์ดํ„ฐ ๊ท ํ˜• + ๋…ธ์ด์ฆˆ ๊ฐ์†Œ
        • ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€
    • SMOTE + Edited Nearest Neighbors (ENN)
      • SMOTE ํ›„ ENN์œผ๋กœ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
        • ์žฅ์  : ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
        • ๋‹จ์  : ๋ฐ์ดํ„ฐ ์†์‹ค ์œ„ํ—˜
    ๐Ÿ“Œ ํ˜ผํ•ฉ ์ƒ˜ํ”Œ๋ง ์ฝ”๋“œ ์˜ˆ์‹œ (SMOTE + Tomek)
    
    from imblearn.combine import SMOTETomek
    
    smote_tomek = SMOTETomek(random_state=42)
    X_train_resampled, y_train_resampled = smote_tomek.fit_resample(X_train, y_train)

     

    ๐Ÿ™†๐Ÿป‍โ™€๏ธ ๊ฒฐ๋ก 

    1๏ธโƒฃ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง (OverSampling)

    • smote ๋˜๋Š” adasyn(์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง) ์ถ”์ฒœ
    • ๋ฐ์ดํ„ฐ ์†Œ์‹ค ์—†์ด ๋ณด๊ฐ• ๊ฐ€๋Šฅ

    2๏ธโƒฃ ์–ธ๋”์ƒ˜ํ”Œ๋ง (UnderSampling)

    • random undersamlping ๋˜๋Š” tomek links ์ถ”์ฒœ
    • ์—ฐ์‚ฐ ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๋ฉด์„œ๋„ ๊ท ํ˜• ์œ ์ง€

    3๏ธโƒฃํ˜ผํ•ฉ ์ƒ˜ํ”Œ๋ง ( Over + Under )

    • smote + tomek (ํ˜ผํ•œ ์ƒ˜ํ”Œ๋ง)cncjs
    • ๊ท ํ˜• ์œ ์ง€ + ๋…ธ์ด์ฆˆ ๊ฐ์†Œ