加载TMDB数据集,进行数据预处理
TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。
参考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2
import pandas as pd credits = pd.read_csv('./tmdb_5000_credits.csv') movies = pd.read_csv('./tmdb_5000_movies.csv')
查看各个dataframe的一般信息
# 这是movies表的信息 movies.head(1) Out[3]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
这是credits表的信息
print(credits.info()) credits.head(1) Out[4]: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4803 entries, 0 to 4802 Data columns (total 4 columns): movie_id 4803 non-null int64 title 4803 non-null object cast 4803 non-null object crew 4803 non-null object dtypes: int64(1), object(3) memory usage: 150.2+ KB None movie_id ... crew 0 19995 ... [{"credit_id": "52fe48009251416c750aca23", "de...
credits表的cast列很奇怪,数据很多
进行具体查看
# 查看credists表的cast列索引0的值,发现是一长串东西 print('cast格式:', type(credits['cast'][0])) # 查看其类型,为`str`类型,无法处理 Out[5]: cast格式: <class 'str'>
json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()
针对的是文件,从文件中读取json
import json type(json.loads(credits['cast'][0])) Out[6]: list
从上面可以看出json.loads()
将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理
import json json_col = ['cast','crew'] for i in json_col: credits[i] = credits[i].apply(json.loads) >> credits['cast'][0][:3] Out[7]: [{'cast_id': 242, 'character': 'Jake Sully', 'credit_id': '5602a8a7c3a3685532001c9a', 'gender': 2, 'id': 65731, 'name': 'Sam Worthington', 'order': 0}, {'cast_id': 3, 'character': 'Neytiri', 'credit_id': '52fe48009251416c750ac9cb', 'gender': 1, 'id': 8691, 'name': 'Zoe Saldana', 'order': 1}, {'cast_id': 25, 'character': 'Dr. Grace Augustine', 'credit_id': '52fe48009251416c750aca39', 'gender': 1, 'id': 10205, 'name': 'Sigourney Weaver', 'order': 2}] print('再次查看cast类型是:',type(credits['cast'][0])) # 数据类型变成了list,可以用于循环处理 Out[8]: 再次查看cast类型是: <class 'list'>
提取其中的名字
credits['cast'][0][:3] # credits第一行的cast,是个列表 Out[9]: [{'cast_id': 242, 'character': 'Jake Sully', 'credit_id': '5602a8a7c3a3685532001c9a', 'gender': 2, 'id': 65731, 'name': 'Sam Worthington', 'order': 0}, {'cast_id': 3, 'character': 'Neytiri', 'credit_id': '52fe48009251416c750ac9cb', 'gender': 1, 'id': 8691, 'name': 'Zoe Saldana', 'order': 1}, {'cast_id': 25, 'character': 'Dr. Grace Augustine', 'credit_id': '52fe48009251416c750aca39', 'gender': 1, 'id': 10205, 'name': 'Sigourney Weaver', 'order': 2}] credits['cast'][0][0]['name'] # 获取第一行第一个字典的人名 Out[10]: 'Sam Worthington'
dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组
# 代码测试如下: i = credits['cast'][0][0] for x in i.items(): print(x) Out[11]: ('cast_id', 242) ('character', 'Jake Sully') ('credit_id', '5602a8a7c3a3685532001c9a') ('gender', 2) ('id', 65731) ('name', 'Sam Worthington') ('order', 0)
创建get_names()函数,进一步分割cast
def get_names(x): return ','.join(i['name'] for i in x) credits['cast'] = credits['cast'].apply(get_names) credits['cast'][:3] Out[12]: 0 Sam Worthington,Zoe Saldana,Sigourney Weaver,S... 1 Johnny Depp,Orlando Bloom,Keira Knightley,Stel... 2 Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph... Name: cast, dtype: object
crew提取导演
credits['crew'][0][0] Out[13]: {'credit_id': '52fe48009251416c750aca23', 'department': 'Editing', 'gender': 0, 'id': 1721, 'job': 'Editor', 'name': 'Stephen E. Rivkin'} # 需要创建循环,找到job是director的,然后读取名字并返回 def director(x): for i in x: if i['job'] == 'Director': return i['name'] credits['crew'] = credits['crew'].apply(director) print(credits[['crew']][:3]) credits.rename(columns = {'crew':'director'},inplace=True) #修改列名 credits[['director']][:3] Out[[14]: crew 0 James Cameron 1 Gore Verbinski 2 Sam Mendes
movies表进行json解析
>>> movies.head(1) Out[15]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的
# 方法同crew表 json_col = ['genres','keywords','spoken_languages','production_countries','production_companies'] for i in json_col: movies[i] = movies[i].apply(json.loads) movies[i] = movies[i].apply(get_names) >>> movies.head(1) Out[16]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 Action,Adventure,Fantasy,Science Fiction http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800