밤톨

밤톨의 작업 공간입니다.

모각소 2차 활동

활동내용

케글 대회 소개

[주소] - https://www.kaggle.com/competitions/titanic

타이타닉 생존자 예측

데이터: 타이타닉 탑증자의 여러 정보(예:나이, 성별, 탑승장소 등등)

목적: 미리 제공된 약 800명의 데이터를 가지고 모델을 학습시켜 테스트 인원들의 생존여부 분석. 생존여부를 정확하게 예측한 정도 즉 정확도를 점차 높이는것이 궁극적인 목적이다.

진행 평가 요소

데이터 전처리 방법
모델 선정 방법 및 이유

피드백

개인 피드백:

모델 선정을 단순히 예측값이 높아서 선정하였다. 여러 요소들을 분석해서 성능을 올리면 좋을것 같다.
데이터 전처리 과정이 데이터를 신경쓰지 않고 단순히 평균값처리를 하거나 dropna 방식으로 처리하였다. 데이터 형식과 데이터 자체를 분석하여 좀더 적합한 결측값 처리 방식을 체택하자.

데이터 형식에 따라 다른 전처리 방법을 사용하였다.

깃허브

대다수의 스터디 인원들이 깃허브 사용경험이 적어 이번 모각소 스터디를 계기로 깃허브 레포짓 및 블로그 작성하는 연습을 시행하였습니다. 이 후 사용에 익숙해지면 깃허브 팀 레포짓을 만들어 팀단위로 대회에 나가는 등 해볼 계획입니다.

활동사진

활동 시작 사진

2시활동사진

중간 활동 사진

3시활동사진

4시활동사진

활동 종료 사진

5시활동사진

대면활동 사진

대면사진

개인 활동내역

개인 활동 블로그 url

전장훈: https://jhwannabe.tistory.com/25
신재현: https://hectorsin.github.io/categories/#mgs
곽세현: https://rhkrtpgus.github.io
강성현: https://seong-hyeon-2.github.io
김수진: https://sujin7822.github.io/

개인 깃 TIL

전장훈: https://github.com/JHWannabe/TIL
신재현: https://github.com/HectorSin/TIL
곽세현: https://github.com/rhkrtpgus/TIL
강성현: https://github.com/seong-hyeon-2/TIL
김수진: https://github.com/sujin7822/TIL

# 데이터 분석 및 정리용 패키지
import pandas as pd
import numpy as np
import random as rnd

# 데이터 시각화 패키지
import seaborn as sns
import matplotlib.pyplot as plt

# 머신러닝용 패키지
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

데이터 불러오기

train_df = pd.read_csv('Data\\train.csv')
test_df = pd.read_csv('Data\\test.csv')
combine = [train_df, test_df]

데이터 특징

print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

범주형 데이터(Categorical Data)

Survived, Sex, Embarked

순위형 데이터(Ordinal Data)

Pclass

수치형 데이터 (Numerical Data)
- 연속형

Age, Fare

이산형

SibSp, Parch

# 데이터 요약
train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# 마지막 데이터
train_df.tail()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

데이터 분석

Cabin - Age 순으로 결측치가 존재한다.

print('Train 데이터의 정보')
train_df.info()

Train 데이터의 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

print('Test 데이터의 정보')
test_df.info()

Test 데이터의 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

# 훈련 데이터셋 설명
train_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

train 데이터의 Pclass별 생존자 수치 분석

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

train 데이터의 Sex별 생존자 수치 분석

train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Sex	Survived
0	female	0.742038
1	male	0.188908

train 데이터의 SibSp별 생존자 수치 분석

train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	SibSp	Survived
1	1	0.535885
2	2	0.464286
0	0	0.345395
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

train 데이터의 Parch별 생존자 수치 분석

train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Parch	Survived
3	3	0.600000
1	1	0.550847
2	2	0.500000
0	0	0.343658
5	5	0.200000
4	4	0.000000
6	6	0.000000

데이터 정리

필요없는 열 제거

Ticket 과 Cabin의 데이터는 필요없을거라 판단되어 해당 열 제거

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

Before (891, 12) (418, 11) (891, 12) (418, 11)
After (891, 10) (418, 9) (891, 10) (418, 9)

for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	7.2500	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	71.2833	C
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	7.9250	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	53.1000	S
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	8.0500	S

최빈값으로 결측치 채우기

freq_port = train_df.Embarked.dropna().mode()[0]
freq_port

'S'

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.339009

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	7.2500	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	71.2833	1
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	7.9250	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	53.1000	0
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	8.0500	0

for dataset in combine:
    dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())
    dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].mean())

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

train_df = train_df.drop(['PassengerId'], axis = 1)
train_df = train_df.drop(['Name'], axis = 1)
train_df = train_df.drop(['SibSp'], axis = 1)

train_df.head(10)

	Survived	Pclass	Sex	Age	Parch	Fare	Embarked
0	0	3	0	22.000000	0	7.2500	0
1	1	1	1	38.000000	0	71.2833	1
2	1	3	1	26.000000	0	7.9250	0
3	1	1	1	35.000000	0	53.1000	0
4	0	3	0	35.000000	0	8.0500	0
5	0	3	0	29.699118	0	8.4583	2
6	0	1	0	54.000000	0	51.8625	0
7	0	3	0	2.000000	1	21.0750	0
8	1	3	1	27.000000	2	11.1333	0
9	1	2	1	14.000000	0	30.0708	1

test_df = test_df.drop(['Name'], axis = 1)
test_df = test_df.drop(['SibSp'], axis = 1)

test_df.head(10)

	PassengerId	Pclass	Sex	Age	Parch	Fare	Embarked
0	892	3	0	34.5	0	7.8292	2
1	893	3	1	47.0	0	7.0000	0
2	894	2	0	62.0	0	9.6875	2
3	895	3	0	27.0	0	8.6625	0
4	896	3	1	22.0	1	12.2875	0
5	897	3	0	14.0	0	9.2250	0
6	898	3	1	30.0	0	7.6292	2
7	899	2	0	26.0	1	29.0000	0
8	900	3	1	18.0	0	7.2292	1
9	901	3	0	21.0	0	24.1500	0

print(test_df['PassengerId'].isnull().sum())
print(test_df['Pclass'].isnull().sum())
print(test_df['Sex'].isnull().sum())
print(test_df['Age'].isnull().sum())
print(test_df['Parch'].isnull().sum())
print(test_df['Fare'].isnull().sum())
print(test_df['Embarked'].isnull().sum())

모델 생성 및 예측

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

((891, 6), (891,), (418, 6))

# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(acc_log)

79.69

로지스틱 회귀분석 결과 특징 분석

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

	Feature	Correlation
1	Sex	2.550329
5	Embarked	0.295061
4	Fare	0.001232
2	Age	-0.034951
3	Parch	-0.184376
0	Pclass	-1.125116

# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print(acc_svc)

68.13

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
print(acc_knn)

83.39

# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
print(acc_gaussian)

78.34

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
print(acc_perceptron)

63.97

# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
print(acc_linear_svc)

77.55


c:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
print(acc_sgd)

74.41

# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print(acc_decision_tree)

98.09

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(acc_random_forest)

98.09

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

	Model	Score
3	Random Forest	98.09
8	Decision Tree	98.09
1	KNN	83.39
2	Logistic Regression	79.69
4	Naive Bayes	78.34
7	Linear SVC	77.55
6	Stochastic Gradient Decent	74.41
0	Support Vector Machines	68.13
5	Perceptron	63.97

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
print(acc_knn)

83.39

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

submission.to_csv('output\\submission.csv', index=False)

2023 27
2022 22

2023

[자료구조] 12.고급정렬

23-06-07 최대 1 분 소요

12장 고급 정렬

[자료구조] Dijkstra 최단 경로 알고리즘

23-06-07 1 분 소요

Dijkstra 최단 경로 알고리즘

[자료구조] 8.트리

23-05-31 4 분 소요

8장 트리

[논문스터디]YOLO(You Only Look Once)

23-05-12 19 분 소요

Abstract

[자료구조] 7.정렬과 탐색

23-04-25 최대 1 분 소요

선택 정렬 알고리즘

[자료구조] 6.연결된 구조

23-04-21 5 분 소요

연결된 구조

[자료구조] 5.큐와 덱

23-04-20 6 분 소요

큐에 대한 정의

[자료구조] 4.스택

23-04-20 3 분 소요

스택의 개념과 동작 원리

[자료구조] 3.리스트와 집합

23-04-20 5 분 소요

1. 리스트 & 집합 & 배열

[자료구조] 2.파이썬 리뷰

23-04-18 최대 1 분 소요

1. 파이썬 이란?

[혁신관리] 1장

23-04-18 14 분 소요

1. 교과서 정리

[자료구조] 1.자료구조와 알고리즘

23-04-17 6 분 소요

1. 자료구조와 알고리즘

논문스터디-NIPS-2012-imagenet-classification

23-04-01 7 분 소요

논문 정리 논문 요약

초보 우분투 20.04 가상환경 세팅

23-03-26 최대 1 분 소요

[우분투] 파이썬 가상환경 만들고 사용하기

우분투 가상환경 세팅

23-03-19 최대 1 분 소요

[우분투] 파이썬 가상환경 만들고 사용하기 - venv 사용하여 가상환경 생성

손글씨 인식 모델

23-02-19 최대 1 분 소요

1. 개념

모각소 6차 활동

23-02-14 최대 1 분 소요

활동내용

모각소 5차 활동

23-02-10 1 분 소요

활동내용

모각소 4차 활동

23-02-03 1 분 소요

활동내용

월간 데이콘 TV 손동작 제어 인식 AI 경진대회

23-01-30 1 분 소요

- hhttps://dacon.io/competitions/official/236050/overview/description

모각소 3차 활동

23-01-27 최대 1 분 소요

활동내용

외국계기업 및 국내외 글로벌 기업 취업대비 전략 특강

23-01-26 2 분 소요

외국계 기업의 정확한 뜻

모각소 2차 활동

23-01-20 8 분 소요

활동내용

ML 기초수학 Ch02

23-01-17 2 분 소요

1. 다항함수(Polynomial Function)

모각소 1차 활동

23-01-13 최대 1 분 소요

활동내용

사이킷런(scikit-learn)으로 학습한 모델 저장하기

23-01-10 2 분 소요

출처: https://gaussian37.github.io/ml-sklearn-saving-model/

ML 기초수학 Ch01

23-01-05 1 분 소요

1. 경사도벡터(Gradient Vector)

맨 위로 이동 ↑

2022

CNN을 활용한 감정분석 + Open Pose를 활용한 감정 캐릭터 얼굴 트래킹

22-12-28 5 분 소요

참고 자료

딥러닝 모델 압축 방법론과 BERT 압축

22-12-25 1 분 소요

[참고] - https://blog.est.ai/2020/03/%EB%94%A5%EB%9F%AC%EB%8B%9D-%EB%AA%A8%EB%8D%B8-%EC%95%95%EC%B6%95-%EB%B0%A9%EB%B2%95%EB%A1%A0%EA%B3%BC-bert-%EC%95%95%E...

영유아 행동인식을 통한 발달평가 - 데이터 개요

22-12-24 2 분 소요

- https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=631

2022년 머신러닝 스터디 보고서

22-12-23 최대 1 분 소요

머신러닝 스터디 팀4 활동 보고서

2023년 토이 프로젝트 주제

22-12-22 최대 1 분 소요

1. 영유아 행동인식을 통한 발달평가 XXX - https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=631

오토인코더와 잠재표현

22-12-21 1 분 소요

딥러닝과 텐서플로

22-12-20 1 분 소요

https://www.itworld.co.kr/insight/109825 [ITWorld - 머신러닝 라이브러리, 텐서플로우의 이해] https://tensorflow.blog/%EC%BC%80%EB%9D%BC%EC%8A%A4-%EB%94%A5%EB%9F%AC%EB%8B...