加载TMDB数据集,进行数据预处理
TMDb电影数据库,数据集中包含来自1960-上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。
参考文章/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2
importpandasaspd
credits=pd.read_csv('./tmdb_5000_credits.csv')
movies=pd.read_csv('./tmdb_5000_movies.csv')
查看各个dataframe的一般信息
#这是movies表的信息
movies.head(1)
Out[3]:
budgetgenreshomepageid...taglinetitlevote_averagevote_count
0237000000[{"id":28,"name":"Action"},{"id":12,"nam.../19995...EntertheWorldofPandora.Avatar7.211800
这是credits表的信息
print(credits.info())
credits.head(1)
Out[4]:
RangeIndex:4803entries,0to4802
Datacolumns(total4columns):
movie_id4803non-nullint64
title4803non-nullobject
cast4803non-nullobject
crew4803non-nullobject
dtypes:int64(1),object(3)
memoryusage:150.2+KB
None
movie_id...crew
019995...[{"credit_id":"52fe48009251416c750aca23","de...
credits表的cast列很奇怪,数据很多
进行具体查看
#查看credists表的cast列索引0的值,发现是一长串东西
print('cast格式:',type(credits['cast'][0]))#查看其类型,为`str`类型,无法处理
Out[5]:
cast格式:
json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()针对的是文件,从文件中读取json
importjson
type(json.loads(credits['cast'][0]))
Out[6]:
list
从上面可以看出json.loads()将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理
importjson
json_col=['cast','crew']
foriinjson_col:
credits[i]=credits[i].apply(json.loads)
>> credits['cast'][0][:3]
Out[7]:
[{'cast_id':242,
'character':'JakeSully',
'credit_id':'5602a8a7c3a3685532001c9a',
'gender':2,
'id':65731,
'name':'SamWorthington',
'order':0},
{'cast_id':3,
'character':'Neytiri',
'credit_id':'52fe48009251416c750ac9cb',
'gender':1,
'id':8691,
'name':'ZoeSaldana',
'order':1},
{'cast_id':25,
'character':'Dr.GraceAugustine',
'credit_id':'52fe48009251416c750aca39',
'gender':1,
'id':10205,
'name':'SigourneyWeaver',
'order':2}]
print('再次查看cast类型是:',type(credits['cast'][0]))
#数据类型变成了list,可以用于循环处理
Out[8]:
再次查看cast类型是:
提取其中的名字
credits['cast'][0][:3]
#credits第一行的cast,是个列表
Out[9]:
[{'cast_id':242,
'character':'JakeSully',
'credit_id':'5602a8a7c3a3685532001c9a',
'gender':2,
'id':65731,
'name':'SamWorthington',
'order':0},
{'cast_id':3,
'character':'Neytiri',
'credit_id':'52fe48009251416c750ac9cb',
'gender':1,
'id':8691,
'name':'ZoeSaldana',
'order':1},
{'cast_id':25,
'character':'Dr.GraceAugustine',
'credit_id':'52fe48009251416c750aca39',
'gender':1,
'id':10205,
'name':'SigourneyWeaver',
'order':2}]
credits['cast'][0][0]['name']#获取第一行第一个字典的人名
Out[10]:
'SamWorthington'
dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组
#代码测试如下:
i=credits['cast'][0][0]
forxini.items():
print(x)
Out[11]:
('cast_id',242)
('character','JakeSully')
('credit_id','5602a8a7c3a3685532001c9a')
('gender',2)
('id',65731)
('name','SamWorthington')
('order',0)
创建get_names()函数,进一步分割cast
defget_names(x):
return','.join(i['name']foriinx)
credits['cast']=credits['cast'].apply(get_names)
credits['cast'][:3]
Out[12]:
0SamWorthington,ZoeSaldana,SigourneyWeaver,S...
1JohnnyDepp,OrlandoBloom,KeiraKnightley,Stel...
2DanielCraig,ChristophWaltz,LéaSeydoux,Ralph...
Name:cast,dtype:object
crew提取导演
credits['crew'][0][0]
Out[13]:
{'credit_id':'52fe48009251416c750aca23',
'department':'Editing',
'gender':0,
'id':1721,
'job':'Editor',
'name':'StephenE.Rivkin'}
#需要创建循环,找到job是director的,然后读取名字并返回
defdirector(x):
foriinx:
ifi['job']=='Director':
returni['name']
credits['crew']=credits['crew'].apply(director)
print(credits[['crew']][:3])
credits.rename(columns={'crew':'director'},inplace=True)#修改列名
credits[['director']][:3]
Out[[14]:
crew
0JamesCameron
1GoreVerbinski
2SamMendes
movies表进行json解析
>>>movies.head(1)
Out[15]:
budgetgenreshomepageid...taglinetitlevote_averagevote_count
0237000000[{"id":28,"name":"Action"},{"id":12,"nam.../19995...EntertheWorldofPandora.Avatar7.211800
可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的
#方法同crew表
json_col=['genres','keywords','spoken_languages','production_countries','production_companies']
foriinjson_col:
movies[i]=movies[i].apply(json.loads)
movies[i]=movies[i].apply(get_names)
>>>movies.head(1)
Out[16]:
budgetgenreshomepageid...taglinetitlevote_averagevote_count
0237000000Action,Adventure,Fantasy,ScienceFiction/19995...EntertheWorldofPandora.Avatar7.211800
这样,就把数据预处理做完了。
本文分享 CSDN - 松鼠爱吃饼干。
如有侵权,请联系 support@ 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。