모각소 2차 활동


케글 대회 소개

[주소] - https://www.kaggle.com/competitions/titanic

타이타닉 생존자 예측

데이터: 타이타닉 탑증자의 여러 정보(예:나이, 성별, 탑승장소 등등)

목적: 미리 제공된 약 800명의 데이터를 가지고 모델을 학습시켜 테스트 인원들의 생존여부 분석. 생존여부를 정확하게 예측한 정도 즉 정확도를 점차 높이는것이 궁극적인 목적이다.

진행 평가 요소

  1. 데이터 전처리 방법

  2. 모델 선정 방법 및 이유


개인 피드백:

  1. 모델 선정을 단순히 예측값이 높아서 선정하였다. 여러 요소들을 분석해서 성능을 올리면 좋을것 같다.

  2. 데이터 전처리 과정이 데이터를 신경쓰지 않고 단순히 평균값처리를 하거나 dropna 방식으로 처리하였다. 데이터 형식과 데이터 자체를 분석하여 좀더 적합한 결측값 처리 방식을 체택하자.

  • 데이터 형식에 따라 다른 전처리 방법을 사용하였다.


대다수의 스터디 인원들이 깃허브 사용경험이 적어 이번 모각소 스터디를 계기로 깃허브 레포짓 및 블로그 작성하는 연습을 시행하였습니다. 이 후 사용에 익숙해지면 깃허브 팀 레포짓을 만들어 팀단위로 대회에 나가는 등 해볼 계획입니다.


활동 시작 사진


중간 활동 사진



활동 종료 사진


대면활동 사진


개인 활동내역

개인 활동 블로그 url

  • 전장훈: https://jhwannabe.tistory.com/25

  • 신재현: https://hectorsin.github.io/categories/#mgs

  • 곽세현: https://rhkrtpgus.github.io

  • 강성현: https://seong-hyeon-2.github.io

  • 김수진: https://sujin7822.github.io/

개인 깃 TIL

  • 전장훈: https://github.com/JHWannabe/TIL

  • 신재현: https://github.com/HectorSin/TIL

  • 곽세현: https://github.com/rhkrtpgus/TIL

  • 강성현: https://github.com/seong-hyeon-2/TIL

  • 김수진: https://github.com/sujin7822/TIL

# 데이터 분석 및 정리용 패키지
import pandas as pd
import numpy as np
import random as rnd

# 데이터 시각화 패키지
import seaborn as sns
import matplotlib.pyplot as plt

# 머신러닝용 패키지
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

데이터 불러오기

train_df = pd.read_csv('Data\\train.csv')
test_df = pd.read_csv('Data\\test.csv')
combine = [train_df, test_df]

데이터 특징

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
  • 범주형 데이터(Categorical Data)

Survived, Sex, Embarked

  • 순위형 데이터(Ordinal Data)


  • 수치형 데이터 (Numerical Data)
    • 연속형

Age, Fare

  • 이산형

SibSp, Parch

# 데이터 요약
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 마지막 데이터
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

데이터 분석

Cabin - Age 순으로 결측치가 존재한다.

print('Train 데이터의 정보')
Train 데이터의 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
print('Test 데이터의 정보')
Test 데이터의 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
# 훈련 데이터셋 설명
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

train 데이터의 Pclass별 생존자 수치 분석

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

train 데이터의 Sex별 생존자 수치 분석

train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Sex Survived
0 female 0.742038
1 male 0.188908

train 데이터의 SibSp별 생존자 수치 분석

train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000

train 데이터의 Parch별 생존자 수치 분석

train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000

데이터 정리

필요없는 열 제거

Ticket 과 Cabin의 데이터는 필요없을거라 판단되어 해당 열 제거

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)
Before (891, 12) (418, 11) (891, 12) (418, 11)
After (891, 10) (418, 9) (891, 10) (418, 9)
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 71.2833 C
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 53.1000 S
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 8.0500 S

최빈값으로 결측치 채우기

freq_port = train_df.Embarked.dropna().mode()[0]
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 7.2500 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 71.2833 1
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 7.9250 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 53.1000 0
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 8.0500 0
for dataset in combine:
    dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())
    dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].mean())

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

train_df = train_df.drop(['PassengerId'], axis = 1)
train_df = train_df.drop(['Name'], axis = 1)
train_df = train_df.drop(['SibSp'], axis = 1)
Survived Pclass Sex Age Parch Fare Embarked
0 0 3 0 22.000000 0 7.2500 0
1 1 1 1 38.000000 0 71.2833 1
2 1 3 1 26.000000 0 7.9250 0
3 1 1 1 35.000000 0 53.1000 0
4 0 3 0 35.000000 0 8.0500 0
5 0 3 0 29.699118 0 8.4583 2
6 0 1 0 54.000000 0 51.8625 0
7 0 3 0 2.000000 1 21.0750 0
8 1 3 1 27.000000 2 11.1333 0
9 1 2 1 14.000000 0 30.0708 1
test_df = test_df.drop(['Name'], axis = 1)
test_df = test_df.drop(['SibSp'], axis = 1)
PassengerId Pclass Sex Age Parch Fare Embarked
0 892 3 0 34.5 0 7.8292 2
1 893 3 1 47.0 0 7.0000 0
2 894 2 0 62.0 0 9.6875 2
3 895 3 0 27.0 0 8.6625 0
4 896 3 1 22.0 1 12.2875 0
5 897 3 0 14.0 0 9.2250 0
6 898 3 1 30.0 0 7.6292 2
7 899 2 0 26.0 1 29.0000 0
8 900 3 1 18.0 0 7.2292 1
9 901 3 0 21.0 0 24.1500 0

모델 생성 및 예측

  • Logistic Regression
  • KNN or k-Nearest Neighbors
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision Tree
  • Random Forrest
  • Perceptron
  • Artificial neural network
  • RVM or Relevance Vector Machine
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
((891, 6), (891,), (418, 6))
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

로지스틱 회귀분석 결과 특징 분석

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)
Feature Correlation
1 Sex 2.550329
5 Embarked 0.295061
4 Fare 0.001232
2 Age -0.034951
3 Parch -0.184376
0 Pclass -1.125116
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

c:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
Model Score
3 Random Forest 98.09
8 Decision Tree 98.09
1 KNN 83.39
2 Logistic Regression 79.69
4 Naive Bayes 78.34
7 Linear SVC 77.55
6 Stochastic Gradient Decent 74.41
0 Support Vector Machines 68.13
5 Perceptron 63.97
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
submission.to_csv('output\\submission.csv', index=False)


