<aside>

</aside>


๐Ÿ“– ์ฑ•ํ„ฐ ์†Œ๊ฐœ

์ฒซ ๋ฒˆ์งธ ๋ถ„๋ฅ˜ ํ”„๋กœ์ ํŠธ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„์ˆ˜ํ•˜์‹  ๊ฒƒ์„ ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ด์ œ ์šฐ๋ฆฌ๋Š” ์ƒˆ๋กœ์šด ๋„์ „์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ๊ณ ๊ฐ์˜ ํŠน์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ฏธ๋ž˜์˜ ์˜๋ฃŒ ๋ณดํ—˜๋น„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํšŒ๊ท€(Regression) ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜๊ฐ€ '์นดํ…Œ๊ณ ๋ฆฌ'๋ฅผ ๋งž์ถ”๋Š” ๋ฌธ์ œ์˜€๋‹ค๋ฉด, ํšŒ๊ท€๋Š” '์—ฐ์†๋œ ์ˆซ์ž'๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ์ฃ . ํ”„๋กœ์ ํŠธ์˜ ์ฒซ ๋‹จ์ถ”๋Š” ์–ธ์ œ๋‚˜ ๊ทธ๋ ‡๋“ฏ, ๋ฌธ์ œ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜๊ณ  **ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA)**์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ๋ฐ์ดํ„ฐ์—๋Š” ์–ด๋–ค ์ˆจ๊ฒจ์ง„ ์ด์•ผ๊ธฐ๊ฐ€ ์žˆ์„์ง€ ํ•จ๊ป˜ ํŒŒํ—ค์ณ ๋ด…์‹œ๋‹ค!


๐ŸŽฏ ์ฑ•ํ„ฐ ๋ชฉํ‘œ


๐Ÿ’ป ์ด๋ฒˆ ์ฑ•ํ„ฐ์˜ ์ „์ฒด ์ฝ”๋“œ ๋ฐ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

์ด๋ฒˆ ์ฑ•ํ„ฐ์˜ ํ•ต์‹ฌ ์ฝ”๋“œ

๐Ÿ’ก 10๊ฐ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ด๋ฒˆ ์‹œ๊ฐ„์˜ ๋ชฉํ‘œ๋Š” ๋ชจ๋ธ๋ง์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ydata-profiling์„ ์‚ฌ์šฉํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

# 1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค€๋น„
from pycaret.datasets import get_data
from ydata_profiling import ProfileReport

# ydata-profiling์ด ์„ค์น˜๋˜์–ด ์žˆ์ง€ ์•Š๋‹ค๋ฉด, ๋จผ์ € ์„ค์น˜ํ•ด์ฃผ์„ธ์š”.
# !pip install ydata-profiling

# 2. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (ํšŒ๊ท€ ์˜ˆ์ œ 'insurance' ๋ฐ์ดํ„ฐ์…‹)
insurance_df = get_data('insurance')

# 3. EDA ๋ฆฌํฌํŠธ ์ƒ์„ฑ
profile = ProfileReport(insurance_df, title="์˜๋ฃŒ ๋ณดํ—˜๋น„ ๋ฐ์ดํ„ฐ EDA ๋ฆฌํฌํŠธ")

# 4. ๋ฆฌํฌํŠธ ํ™•์ธ (Jupyter Notebook ํ™˜๊ฒฝ)
profile

์ฝ”๋“œ ์‹คํ–‰ ๊ฒฐ๊ณผ ๋ฏธ๋ฆฌ๋ณด๊ธฐ

profile ์‹คํ–‰ ๊ฒฐ๊ณผ

ydata-profiling ๋ฆฌํฌํŠธ๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ charges ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ(Distribution)๋ฅผ ์œ ์‹ฌํžˆ ์‚ดํŽด๋ณด์„ธ์š”.

image.png

charges ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ

๋ฆฌํฌํŠธ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์€ ์˜๋ฃŒ๋น„๊ฐ€ ๋‚ฎ๊ณ (์™ผ์ชฝ์œผ๋กœ ์น˜์šฐ์นจ), ์ผ๋ถ€ ์†Œ์ˆ˜์˜ ์‚ฌ๋žŒ๋“ค๋งŒ ๋งค์šฐ ๋†’์€ ์˜๋ฃŒ๋น„๋ฅผ ์ง€์ถœํ•˜๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ธด ๊ผฌ๋ฆฌ(right-skewed) ํ˜•ํƒœ๋ฅผ ๋ฑ๋‹ˆ๋‹ค.

image.png