<aside>

</aside>


๐Ÿ“– ์ฑ•ํ„ฐ ์†Œ๊ฐœ

4๊ฐ•์—์„œ ์šฐ๋ฆฌ ์‹คํ—˜์˜ '์„ค๊ณ„๋„'๋ฅผ ๊ทธ๋ ธ๋‹ค๋ฉด, ์ด๋ฒˆ ์‹œ๊ฐ„์—๋Š” ๊ทธ ์„ค๊ณ„๋„๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ฒจ๋‹จ '์ž๋™ํ™” ์„ค๋น„'๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์‹ค์ œ ๋ฐ์ดํ„ฐ๋Š” ๊ฒฐ์ธก์น˜, ์ด์ƒ์น˜ ๋“ฑ ์ •์ œ๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด์ฃ . setup() ํ•จ์ˆ˜๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ๋“ค์„ ํ™œ์šฉํ•ด ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ณ , ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€ํ•œ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด์ œ ์—ฌ๋Ÿฌ๋ถ„์˜ ๋ถ„์„์€ ํ•œ ๋‹จ๊ณ„ ๋” ์ •๊ตํ•ด์งˆ ๊ฒ๋‹ˆ๋‹ค!


๐ŸŽฏ ์ฑ•ํ„ฐ ๋ชฉํ‘œ


๐Ÿ’ป ์ด๋ฒˆ ์ฑ•ํ„ฐ์˜ ์ „์ฒด ์ฝ”๋“œ ๋ฐ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

์ด๋ฒˆ ์ฑ•ํ„ฐ์˜ ํ•ต์‹ฌ ์ฝ”๋“œ

๐Ÿ’ก 4๊ฐ•์˜ setup() ์ฝ”๋“œ์— ๊ณ ๊ธ‰ ์ „์ฒ˜๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, ๋”์šฑ ์ •๊ตํ•œ ์‹คํ—˜ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. get_config()๋กœ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ด๋ฒˆ ๊ฐ•์˜์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค!

# 1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค€๋น„
from pycaret.datasets import get_data
from pycaret.regression import setup, get_config

# 2. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
insurance_df = get_data('insurance')

# 3. ๊ณ ๊ธ‰ ์ „์ฒ˜๋ฆฌ๊ฐ€ ํฌํ•จ๋œ ์‹คํ—˜ ํ™˜๊ฒฝ ์„ค์ •
# ๋จผ์ € ๊ธฐ๋ณธ ์ „์ฒ˜๋ฆฌ๋งŒ ์ ์šฉ๋œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ณ ,
# ์ดํ›„ normalize, transformation ๋“ฑ ๊ณ ๊ธ‰ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํšจ๊ณผ๋ฅผ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.
pro_reg_experiment = setup(
    data = insurance_df,
    target = 'charges',
    session_id = 123,

    # --- 5๊ฐ•์—์„œ ์ถ”๊ฐ€/๋ณ€๊ฒฝ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค ---
    numeric_imputation = 'mean',     # ์ˆ˜์น˜ํ˜• ๊ฒฐ์ธก์น˜๋Š” 'ํ‰๊ท ๊ฐ’'์œผ๋กœ ์ฑ„์šฐ๊ธฐ
    ignore_features = ['region'],    # region ๋ณ€์ˆ˜๋Š” ๋ถ„์„์—์„œ ์ œ์™ธ

    # ์•„๋ž˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ ๊ฐœ๋… ํ•™์Šต ํ›„, ์ ์šฉํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณ„๋„๋กœ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
    # normalize = True,
    # transformation = True,
    # remove_outliers = True,
)

# 4. ๋ณ€ํ™˜๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ
transformed_train_df = get_config('X_train_transformed')
print("--- ์›๋ณธ ๋ฐ์ดํ„ฐ ---")
print(insurance_df.head())
print("\\n--- ๋ณ€ํ™˜ ํ›„ ๋ฐ์ดํ„ฐ (๊ธฐ๋ณธ ์ „์ฒ˜๋ฆฌ) ---")
print(transformed_train_df.head())

์ฝ”๋“œ ์‹คํ–‰ ๊ฒฐ๊ณผ ๋ฏธ๋ฆฌ๋ณด๊ธฐ (๊ธฐ๋ณธ ์ „์ฒ˜๋ฆฌ)

์›๋ณธ ๋ฐ์ดํ„ฐ ํ™•์ธ (insurance_df.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4.  32    male  28.880         0     no  northwest   3866.85520

setup() ์‹คํ–‰ ํ›„ ๋‚˜ํƒ€๋‚˜๋Š” ์ •๋ณด ํ…Œ์ด๋ธ” (๊ธฐ๋ณธ ์ „์ฒ˜๋ฆฌ)

Numeric imputation์ด mean์œผ๋กœ ์„ค์ •๋˜๊ณ , ์•„์ง ๋‹ค๋ฅธ ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ์€ ์ ์šฉ๋˜์ง€ ์•Š์€ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.