200字范文 > Kaggle | Titanic - Machine Learning from Disaster【泰坦尼克号生存预测】 | baseline及优秀notebook总结

Kaggle | Titanic - Machine Learning from Disaster【泰坦尼克号生存预测】 | baseline及优秀notebook总结

时间：2018-11-08 08:24:35

文章目录

一、数据介绍二、代码三、代码优化方向

一、数据介绍

Titanic - Machine Learning from Disaster是主要针对机器学习初学者开展的比赛，数据格式比较简单，为结构化数据。数据的数量较少（训练集892条，测试集419条），因此，就算找到有效的特征有良好的准确度，但很有可能因为一些小变动就让准确度下降。事实上，Public Leaderboard分数较高的notebook，未必对未知数据有良好的预测能力，可能只是过度比对测试数据碰巧得到吻合的结果罢了。在泰坦尼克号公开资料集中，每个用户有如下特征：

Survived: 是否存活（label）PassengerId: (乘客ID)Pclass（用户阶级）：1 - 1st class，高等用户；2 - 2nd class，中等用户；3 - 3rd class，低等用户；Name（名字）Sex（性别）Age（年龄）SibSp：描述了泰坦尼克号上与乘客同行的兄弟姐妹（Siblings）和配偶（Spouse）数目；Parch：描述了泰坦尼克号上与乘客同行的家长（Parents）和孩子（Children）数目；Ticket（船票号）Fare（乘客费用）Cabin（船舱）Embarked（港口）：用户上船时的港口

二、代码

代码实现包含如下基本步骤：

特征处理模型搭建模型调参模型集成（融合）

#!usr/bin/env python# -*- coding:utf-8 -*-"""@author: liujie@file: titanic.py@time: /09/08@desc:Kaggle案例——泰坦尼克号"""import numpy as npimport pandas as pdfrom xgboost import XGBClassifierfrom sklearn.model_selection import KFoldfrom sklearn.preprocessing import LabelEncoderfrom sklearn.metrics import log_loss, accuracy_score# TODO:1.构造训练集与测试集train = pd.read_csv("data/train.csv", sep=",", header=0)test = pd.read_csv("data/test.csv", sep=",", header=0)x_train = train.drop(['Survived'], axis=1)y_train = train['Survived']x_test = test.copy()# TODO:2.建立特征# 去除 PassengerId,Name, Ticket, Cabinx_train = x_train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)x_test = x_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)# 对Sex、Embarked进行label encodingfor c in ["Sex", "Embarked"]:le = LabelEncoder()le.fit(x_train[c].fillna("NA"))x_train[c] = le.transform(x_train[c].fillna("NA"))x_test[c] = le.transform(x_test[c].fillna("NA"))# TODO:3.建立模型xgb = XGBClassifier(n_estimators=20, random_state=)xgb.fit(x_train, y_train) # 训练pred = xgb.predict_proba(x_test)[:, 1] # 预测# 将结果进行转换pred = np.where(pred > 0.5, 1, 0)# TODO:4.K折交叉验证# 用 List 保存各 fold 的 accuracy 与 logloss 分数scores_accuracy = []scores_logloss = []kf = KFold(n_splits=4, shuffle=True, random_state=)for tr_idx, va_idx in kf.split(x_train):# 分为训练集与验证集tr_x, va_x = x_train.iloc[tr_idx], x_train.iloc[va_idx]tr_y, va_y = y_train.iloc[tr_idx], y_train.iloc[va_idx]# 建立XGBoost模型xgb = XGBClassifier(n_estimators=20, random_state=)xgb.fit(tr_x, tr_y)# 对验证集进行预测va_pred = xgb.predict_proba(va_x)[:, 1]# 评测logloss与acclogloss = log_loss(va_y, va_pred)acc = accuracy_score(va_y, va_pred > 0.5)scores_accuracy.append(acc)scores_logloss.append(logloss)# 输出每折评价指标的平均值logloss = np.mean(scores_logloss)accuracy = np.mean(scores_accuracy)print(f'logloss: {logloss:.4f}, accuracy: {accuracy:.4f}')# logloss: 0.4300, accuracy: 0.8137# TODO:5.调整超参数import itertools# 准备用于调整的超参数param_space = {"max_depth": [3, 5, 7], "min_child_weight": [1.0, 2.0, 4.0]}# 产生所有超参数组合param_combinations = itertools.product(param_space["max_depth"], param_space["min_child_weight"])# 用 List 保存各参数组合的logloss 分数params = []scores = []for max_depth, min_child_weight in param_combinations:# 保存每个fold的分数scores_fold = []kf = KFold(n_splits=4, shuffle=True, random_state=)for tr_idx, va_idx in kf.split(x_train):# 分为训练集与验证集tr_x, va_x = x_train.iloc[tr_idx], x_train.iloc[va_idx]tr_y, va_y = y_train.iloc[tr_idx], y_train.iloc[va_idx]# 建立XGBoost模型xgb = XGBClassifier(n_estimators=20, max_depth=max_depth, min_child_weight=min_child_weight, random_state=)xgb.fit(tr_x, tr_y)# 对验证集进行预测va_pred = xgb.predict_proba(va_x)[:, 1]# 评测logloss与acclogloss = log_loss(va_y, va_pred)scores_fold.append(logloss)score_mean = np.mean(scores_fold)params.append((max_depth, min_child_weight))scores.append(score_mean)# 找出评价指标最佳的组合best_idx = np.argsort(scores)[0]best_param = params[best_idx]print(f'best_param={best_param},best_score={scores[best_idx]}')# best_param=(7, 2.0),best_score=0.4212539335124341# TODO:6.建立逻辑回归模型所需特征，为后续模型集成做准备from sklearn.preprocessing import OneHotEncoderx2_train = train.drop(['Survived'], axis=1)y2_train = train['Survived']x2_test = test.copy()# 去除训练、测试资料中的PassengerId、Name、Ticket、Cabinx2_train = x2_train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)x2_test = x2_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)# 对类别特征进行oneHot编码cat_cols = ['Sex', 'Embarked', 'Pclass']ohe = OneHotEncoder(categories='auto', sparse=False)ohe.fit(x2_train[cat_cols].fillna('NA'))# 构建one_hot编码后的特征名ohe_columns = []for i, c in enumerate(cat_cols):# .categories_：表示该特征包含哪些类别的值ohe_columns += [f'{c}_{v}' for v in ohe.categories_[i]]# 将one_hot编码后的结果保存到dataframe中ohe_train_x2 = pd.DataFrame(ohe.transform(x2_train[cat_cols].fillna('NA')), columns=ohe_columns)ohe_test_x2 = pd.DataFrame(ohe.transform(x2_test[cat_cols].fillna('NA')), columns=ohe_columns)# 去除原数据中已经onehot编码的特征x2_train = x2_train.drop(cat_cols, axis=1)x2_test = x2_test.drop(cat_cols, axis=1)# 将onehot编码后的dataframe与原数据合并x2_train = pd.concat([x2_train, ohe_train_x2], axis=1)x2_test = pd.concat([x2_test, ohe_test_x2], axis=1)# 填充缺失值num_cols = ['Age', 'SibSp', 'Parch', 'Fare']for col in num_cols:x2_train[col].fillna(x2_train[col].mean(), inplace=True)x2_test[col].fillna(x2_train[col].mean(), inplace=True)# 将Fare取对数，变为正态分布x2_train['Fare'] = np.log1p(x2_train['Fare'])x2_test['Fare'] = np.log1p(x2_test['Fare'])# TODO:7.模型集成from sklearn.linear_model import LogisticRegressionxgb_model = XGBClassifier(n_estimators=20, max_depth=7, min_child_weight=2.0, random_state=)xgb_model.fit(x_train, y_train)xgb_pred = xgb_model.predict_proba(x_test)[:, 1]lr_model = LogisticRegression(solver='lbfgs', max_iter=300)lr_model.fit(x2_train, y2_train)lr_pred = lr_model.predict_proba(x2_test)[:, 1]# 取多个模型预测结果的加权值pred = xgb_pred * 0.8 + lr_pred * 0.2label_pred = np.where(pred > 0.5, 1, 0)# 将预测结果进行保存submission = pd.DataFrame({"PassengerId": test['PassengerId'], "Survived": label_pred})submission.to_csv("submission.csv", index=False)

三、代码优化方向

How am I doing with my score?这个notebook将参赛者们使用了什么手法，出现了什么样的分数整合起来。可进行参考。

0.77990 Gender + Class + Embarked LightGBM in Python.

这个notebook仅使用Sex，Embarked，Pclass这三个特征来进行预测；使用的模型是LightGBM。
对Embarked进行缺失值填充对Sex与Embarked进行标签编码利用网格搜索对lgb模型进行10折交叉验证，寻找最佳模型，得到预测结果

0.78468 Name-only text vectorization and PCA with a 3D interactive plot.

这个notebook仅使用Name特征来进行预测，使用的模型是KNeighborsClassifier。
利用单词计数+PCA来矢量化名称特征，使用 KNeighborsClassifier模型并使用 GridSearchCV 对其进行调整参数，最后得到预测结果

0.78947 Gender + Class + Embarked + Age using SVM

这个notebook仅使用Sex，Embarked，Pclass、Age这三个特征来进行预测；使用的模型是SVM。
读取数据，并处理Embarked特征中的缺失值针对Age特征，只保留年轻人，以及给Age缺失的人创建一个指标标志针对Embarked，Pclass特征，采用oneHot编码；针对Sex特征，采用Lable 编码使用 SVC模型并使用 GridSearchCV 对其进行调整参数，最后得到预测结果

0.79904 Neural network (keras) by Rafael Vicente Leite.

这个notebook使用ANN模型
读取数据提取姓名中的性别标志丢弃不相关特征’Ticket’, ‘Cabin’填充’Embarked’特征列中的缺失值对Name特征中的性别标志进行标签编码对Sex与Embarked进行标签编码得到对应性别与用户等级的平均票价与年龄，用于缺失值填充对’Embarked’, ‘Name’, 'Pclass’特征进行oneHot编码对所有特征进行标准化构建ANN模型模型预测和评估生成提交文件

0.79904 Using ethnicity feature by Frederik Hasecke.

这个notebook使用
数据准备填充Embarkment、Fare、Age中的缺失值，Age填充规则比较复杂从Name特征中分析出种族将Sex、处理后Name标志、Embarked、Fare、Ethnicity、Age都进行标签编码模型训练
Gauss、KNN、Log Reg、RandomF、SVM模型融合

0.80861 Simple stacking by Anisotropic stacking is ubiquitous in competitions.

这个notebook使用stacking集成方法，这种方法容易过拟合
数据准备构造FamilySize与IsAlone的特征填充Embarked（港口）的缺失值填充Fare的缺失值填充Age，并用其置信区间中的值(平均值加减标准差)构造Name的Title特征对Sex、Title、Embarked、Fare、Age进行分箱 stacking RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier与SVC模型都进行5折交叉验证，保留训练集与测试集的预测值利用XGBClassifier模型，输入为预测值，输出为真实值来构建模型，得到预测结果（提交文件）

0.80861 Voting/ensembling by Nick Brooks An impressive number of models is packed in almost one hour of running time!

这个notebook使用stacking模型
数据加载，数据预处理及特征工程构建FamilySize、Name_length、IsAlone、Title利用姓名Title来填充Age中的缺失值用Embarked的众数来填充Embarked（港口）的缺失值用Fare的均值来填充Fare中的缺失值对Sex特征进行标签编码，对Title特征进行标签编码，并用其众数来填充缺失值对Embarked特征进行标签编码丢弃不相关特征’Ticket’, 'Cabin将连续变量缩放到-1到1之间将数据集分为训练集与测试集模型训练 K-Nearest NeighborsSGDClassifierDecision TreesRandom ForestAdaBoostClassifierGradientBoostingClassifierXGBClassifierCatBoostlgbLogisticRegressionMLPClassifierSVC 将上述几种模型的结果进行stacking Logistic Regression

0.80861 Kaggle Titanic with Tensorflow by nme-42 It is quite an interesting kernel.

这个notebook使用DNN
数据加载数据预处理构造船舱级别特征deck level丢弃船舱特征Cabin填充Embarked特征中的缺失值，用’N’填充表示缺失填充Fare中的缺失值，用对应Pclass的Fare的众数填充Age中的缺失值特征工程构建Family Size特征
后续部分没咋见过，感兴趣自己看！

0.81339 Titanic Using Ticket Grouping by Jack Roberts.

基于规则进行预测

0.82775 Frank Sylla engineers several features.

数据加载特征工程从Name特征中构建surname与Title特征，并对Title特征进行标签编码-TitleCat构建家庭人数(FamilySize)特征并切分后进行标签编码构建Name长度特征-NameLength填充Fare特征中的缺失值将Sex特征进行哑编码将Embarked特征与Cabin特征的第一位进行标签编码针对Cabin特征的数字部分构建CabinType特征，表示船舱号是奇数、偶数还是空构建person特征，用于区分CHILD/FEMALE ADULT/MALE ADULT，并对这个特征进行哑编码后再与元特征进行拼接

0.83253 Konstantin brings attention to feature scaling, which is essential when working with the kNN algorithm.

这个Notebook使用KNN算法，达到了非常好的效果
特征工程基于Name特征中构建Title特征，基于Title特征来估算Age中的缺失值构建Family_Size特征构建姓氏特征Last_Name对船票价格特征Fare中的缺失值用均值进行填充基于Last_Name、Fare、Survived、Ticket特征构建Family_Survival特征将Fare特征中的缺失值用Fare中的中位数进行填充，并先进行等位分桶，再进行标签编码将Age特征先进行等位分桶，再进行标签编码将性别特征进行标签编码将特征缩放到-1到1之间模型训练
利用Grid Search CV对KNN进行超参数调参找到KNN模型的最优参数来进行预测

参考：

《kaggle竞赛攻顶秘笈》Titanic - Machine Learning from DisasterHow am I doing with my score?

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。