200字范文 > 机器学习项目实战----泰坦尼克号获救预测(一)

机器学习项目实战----泰坦尼克号获救预测(一)

时间：2023-10-12 09:10:29

一、任务基础

泰坦尼克号沉没是历史上最著名的沉船事故之一。194月15日，在她的处女航中，泰坦尼克号在与冰山相撞后沉没，在2224名乘客和机组人员中造成1502人死亡。这场耸人听闻的悲剧震惊了国际社会，并为船舶制定了更好的安全规定。造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管幸存下沉有一些运气因素，但有些人比其他人更容易生存，例如妇女，儿童和上流社会。在这个案例中我们将运用机器学习来预测哪些乘客可以幸免于悲剧。

数据集链接：/s/1bVnIM5JVZjib1znZIDn10g 。提取码：1htm 。

读取titanic_train数据集

import pandas# 读取数据集titanic = pandas.read_csv('titanic_train.csv')titanic.head(10)

查看数据集前10行

特征名词解释

二、数据预处理

可以看到Age列有缺失值(NaN)。一般来说，数据发生缺失的话有两种处理方法，一种填充缺失值，一种直接舍弃这个特征。这里一般来说Age对结果是有较大影响的，我们可以对缺失值进行填充，这里可以填充平均值。

# Age 缺失值填充titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())print(titanic.describe())

填充后查看数据集的描述

机器学习算法一般来说解决不了对字符的分类。因为我们是要对Survived这列‘’0‘’和"1"进行分类嘛。所以我们就要把"Sex"这一列的数据进行处理，把它映射为数值型。那我们就把"male"和“female”进行处理，分别用0和1替代。

# print(titanic['Sex'].unique())# Replace all the occurences of male with the number 0.# 用数字0替换所有出现的男性。titanic.loc[titanic["Sex"] == "male", "Sex"] = 0titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

同时，我们也把"Embarked"这一列数据进行同样的处理

# print(titanic['Embarked'].unique())# Embarked：上船港口，有三个取值，C/S/Q，是文字形式，不利于分析，# 故可能需要映射到数值的值，而且有2个缺失值titanic['Embarked'] = titanic['Embarked'].fillna('S') # 缺失值填充为这一列的众数'S'titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

三、分类任务

首先使用线性回归算法来进行分类

# Import the linear regression(回归) class# 注意不要导错库 from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import KFold# The columns we'll use to predict the targetpredictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']# Initialize our algorithm classalg = LinearRegression()# Generate cross validation folds for the titanic dataset. It return the row indices# corresponding(相应的) to train and test.# 为Titanic数据集生成交叉验证折叠。它返回与训练和测试相对应的行索引。# We set random_state to ensure we get the same splits every time we run this.# kf = KFold(titanic.shape[0], n_folds=3, random_state=1) 写法错误已被弃用# 样本平均分成3份，3折交叉验证kf = KFold(n_splits=3,shuffle=False, random_state=1)# 注意这里不是kf.split(titanic.shape[0])，会报如下错误：# Singleton array array(891) cannot be considered a valid collection.predictions = []

# 交叉验证划分训练集验证集for train, test in kf.split(titanic):# The predictors we're using the train the algorithm. Note how we only take# the rows in the train folds# 注意我们只得到训练集的rowstrain_predictors = titanic[predictors].iloc[train, :]# The target we're using to train the algorithm.train_target = titanic['Survived'].iloc[train]# Training the algorithm using the predictors and targetalg.fit(train_predictors, train_target)# We can now make predictions on the test foldtest_predictions = alg.predict(titanic[predictors].iloc[test, :])predictions.append(test_predictions)

查看线性回归准确率

import numpy as np# The predictions are in three separate numpy arrays. Concatenate them into one.# We concatenate them on asix 0, as they only have one axis.predictions = np.concatenate(predictions,axis=0)# Map predictions to outcomes (only possible outcomes are 1 and 0)predictions[predictions > .5] = 1 # 映射成分类结果计算准确率predictions[predictions <= .5] = 0# 注意这一行与源代码有出入accuracy = sum(predictions==titanic['Survived'])/len(predictions)# 验证集的准确率 print(accuracy)

得到的准确率为

0.7833894500561167

对于一个二分类问题来说，这个准确率似乎不太行，接下来用逻辑回归算法试下

from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionalg = LogisticRegression(random_state=1)# Compute the accuracy score for all the cross validation folds,# (much simper than what we did before!)scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)# Take the mean of the scores (because we have one for each fold)print(scores.mean())

得到的准确率为，可以发现效果要好了一点。

0.8047138047138048

上面得到的结果都是对交叉验证后的验证集来进行分类，在实际结果中，应该使用测试数据集来进行分类。

读取测试数据集并填充数据集，然后进行数值映射，与上面类似。

titanic_test = pandas.read_csv("test.csv")titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

通过上面发现，似乎线性回归，逻辑回归这类算法似乎不太行，那这次再用随机森林算法来试下(一般来说随机森林算法比线性回归和逻辑回归算法的效果好一点)，注意随机森林参数的变化。

from sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import KFoldfrom sklearn.ensemble import RandomForestClassifierpredictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]# Initialize our algorithm with the default paramters# n_estimators is the number of trees we want to make# min_samples_split is the minimum number of rows we need to make a split# min_samples_leaf is the minimum number of samples we can have at the place where a# tree branch(分支) ends (the bottom points of the tree)alg = RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)kf = KFold(n_splits=3, shuffle=False, random_state=1)scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)# Take the mean of the scores (because we have one for each fold)print(scores.mean())

准确率为：

0.7856341189674523

发现准确率还是不太行。在机器学习中，调整参数也是非常重要的，一般通过参数的调整来对模型进行优化。这次调整随机森林的参数。

alg = RandomForestClassifier(random_state=1,n_estimators=100,min_samples_split=4,min_samples_leaf=2)# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)kf = KFold(n_splits=3, shuffle=False, random_state=1)scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)# Take the mean of the scores (because we have one for each fold)print(scores.mean())

得到的准确率为：

0.8148148148148148

未完待续。。。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。