๐‘๐‘œ๐‘ก๐‘’๐‘๐‘œ๐‘œ๐‘˜

์นด์ด์ œ๊ณฑ๊ฒ€์ • ๋ณธ๋ฌธ

๋จธ์‹ ๋Ÿฌ๋‹ ๐Ÿฆพ

์นด์ด์ œ๊ณฑ๊ฒ€์ •

seoa__ 2025. 2. 3. 21:22

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜

# ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pandas as pd
import scipy.stats as stats

# ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ๋‹ค๋ฅธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค๊ณผ์˜ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ์ˆ˜ํ–‰
categorical_features = ['์ œ๊ณตํ”Œ๋žซํผ', '์ฃผ์ฐจ๊ฐ€๋Šฅ์—ฌ๋ถ€', '๋ฐฉํ–ฅ', '๋งค๋ฌผํ™•์ธ๋ฐฉ์‹']

def perform_chi_square_test_with_false_listing(train, feature):
    """ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€์™€ ํŠน์ • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์นด์ด์ œ๊ณฑ ๊ฒ€์ •์„ ํ†ตํ•ด ๋ถ„์„"""

    # ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ํŠน์ • ์ปฌ๋Ÿผ์˜ ๋ถ„ํ• ํ‘œ ์ƒ์„ฑ
    contingency_table = pd.crosstab(train['ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€'], train[feature])

    # ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ์ˆ˜ํ–‰
    chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

    print(f"\n๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs {feature}")
    print(f"Chi-square Statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Degrees of Freedom: {dof}")

    # ์œ ์˜์„ฑ ํ•ด์„
    alpha = 0.05
    if p_value < alpha:
        print(f"โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '{feature}' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.")
    else:
        print(f"โŒ ๊ท€๋ฌด๊ฐ€์„ค ์ฑ„ํƒ! '{feature}' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ.")

# ๐Ÿ”น ํ—ˆ์œ„๋งค๋ฌผ ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ์‹คํ–‰
for feature in categorical_features:
    perform_chi_square_test_with_false_listing(train, feature)
'''
๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ์ œ๊ณตํ”Œ๋žซํผ
Chi-square Statistic: 19.9206
P-value: 0.0686
Degrees of Freedom: 12
โŒ ๊ท€๋ฌด๊ฐ€์„ค ์ฑ„ํƒ! '์ œ๊ณตํ”Œ๋žซํผ' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ์ฃผ์ฐจ๊ฐ€๋Šฅ์—ฌ๋ถ€
Chi-square Statistic: 37.9538
P-value: 0.0000
Degrees of Freedom: 1
โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '์ฃผ์ฐจ๊ฐ€๋Šฅ์—ฌ๋ถ€' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ๋ฐฉํ–ฅ
Chi-square Statistic: 125.7040
P-value: 0.0000
Degrees of Freedom: 7
โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '๋ฐฉํ–ฅ' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ๋งค๋ฌผํ™•์ธ๋ฐฉ์‹
Chi-square Statistic: 5.2510
P-value: 0.0724
Degrees of Freedom: 2
โŒ ๊ท€๋ฌด๊ฐ€์„ค ์ฑ„ํƒ! '๋งค๋ฌผํ™•์ธ๋ฐฉ์‹' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ.
'''

 

์—ฐ์†ํ˜• ๋ณ€์ˆ˜

import pandas as pd
import scipy.stats as stats
import numpy as np

# ๐Ÿ”น ์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋ฅผ ๋ฒ”์ฃผํ˜•์œผ๋กœ ๋ณ€ํ™˜ (qcut์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™ ๋ถ„ํ• )
def bin_numerical_feature(df, feature, bins=5):
    """์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋ฅผ ์นด์ด์ œ๊ณฑ ๊ฒ€์ •์„ ์œ„ํ•œ ๋ฒ”์ฃผํ˜•์œผ๋กœ ๋ณ€ํ™˜"""
    df[feature + '_bin'] = pd.qcut(df[feature], bins, duplicates='drop')
    return df

# ๐Ÿ”น ํ—ˆ์œ„๋งค๋ฌผ ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ์‹คํ–‰
def perform_chi_square_test_with_false_listing(train, feature):
    """ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€์™€ ํŠน์ • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์นด์ด์ œ๊ณฑ ๊ฒ€์ •์„ ํ†ตํ•ด ๋ถ„์„"""

    # ์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋Š” binning ํ›„ '_bin' ์ ‘๋ฏธ์‚ฌ๋ฅผ ๋ถ™์—ฌ ์ฒ˜๋ฆฌ
    if train[feature].dtype != 'object':  
        train = bin_numerical_feature(train, feature)
        feature = feature + '_bin'  # ๋ณ€ํ™˜๋œ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉ

    # ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ํŠน์ • ์ปฌ๋Ÿผ์˜ ๋ถ„ํ• ํ‘œ ์ƒ์„ฑ
    contingency_table = pd.crosstab(train['ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€'], train[feature])

    # ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ์ˆ˜ํ–‰
    chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

    print(f"\n๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs {feature}")
    print(f"Chi-square Statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.10f}")  # ๋” ์ •๋ฐ€ํ•œ p-value ์ถœ๋ ฅ
    print(f"Degrees of Freedom: {dof}")

    # ์œ ์˜์„ฑ ํ•ด์„
    alpha = 0.05
    if p_value < alpha:
        print(f"โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '{feature}' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.")
    else:
        print(f"โŒ ๊ท€๋ฌด๊ฐ€์„ค ์ฑ„ํƒ! '{feature}' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ.")

# ๐Ÿ”น ์‚ฌ์šฉํ•  ๋ณ€์ˆ˜ ๋ชฉ๋ก (์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋Š” ์ž๋™ ๋ณ€ํ™˜๋จ)
categorical_features = ['๋ณด์ฆ๊ธˆ', '์›”์„ธ', '์ „์šฉ๋ฉด์ ', '์ด์ธต']

# ๐Ÿ”น ๊ฒ€์ • ์‹คํ–‰
for feature in categorical_features:
    perform_chi_square_test_with_false_listing(train, feature)

 

'''
๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ๋ณด์ฆ๊ธˆ_bin
Chi-square Statistic: 5.1292
P-value: 0.2742944260
Degrees of Freedom: 4
โŒ ๊ท€๋ฌด๊ฐ€์„ค ์ฑ„ํƒ! '๋ณด์ฆ๊ธˆ_bin' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ์›”์„ธ_bin
Chi-square Statistic: 11.2643
P-value: 0.0237492058
Degrees of Freedom: 4
โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '์›”์„ธ_bin' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ์ „์šฉ๋ฉด์ _bin
Chi-square Statistic: 13.7645
P-value: 0.0080857787
Degrees of Freedom: 4
โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '์ „์šฉ๋ฉด์ _bin' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.

๐Ÿ“Œ ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ๊ฒฐ๊ณผ: ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ vs ์ด์ธต_bin
Chi-square Statistic: 42.4263
P-value: 0.0000000136
Degrees of Freedom: 4
โœ… ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! '์ด์ธต_bin' ๋ณ€์ˆ˜์™€ ํ—ˆ์œ„๋งค๋ฌผ์—ฌ๋ถ€ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์œ ์˜๋ฏธํ•จ.
'''