数据组合: Merge 和 Join 方法¶
#显示设置
import pandas as pd
import numpy as np
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
数据连接的类型¶
pd.merge() 函数有三种分类 一对一, 多对一, 和 多对多 三种
一对一链接¶
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
df1
| employee | group | |
|---|---|---|
| 0 | Bob | Accounting |
| 1 | Jake | Engineering |
| 2 | Lisa | Engineering |
| 3 | Sue | HR |
df2
| employee | hire_date | |
|---|---|---|
| 0 | Lisa | 2004 |
| 1 | Bob | 2008 |
| 2 | Jake | 2012 |
| 3 | Sue | 2014 |
将2个数据框合成一个,使用 pd.merge() 函数:
df3 = pd.merge(df1, df2)
df3
| employee | group | hire_date | |
|---|---|---|---|
| 0 | Bob | Accounting | 2008 |
| 1 | Jake | Engineering | 2012 |
| 2 | Lisa | Engineering | 2004 |
| 3 | Sue | HR | 2014 |
多对一链接¶
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
df3
| employee | group | hire_date | |
|---|---|---|---|
| 0 | Bob | Accounting | 2008 |
| 1 | Jake | Engineering | 2012 |
| 2 | Lisa | Engineering | 2004 |
| 3 | Sue | HR | 2014 |
df4
| group | supervisor | |
|---|---|---|
| 0 | Accounting | Carly |
| 1 | Engineering | Guido |
| 2 | HR | Steve |
pd.merge(df3, df4)
| employee | group | hire_date | supervisor | |
|---|---|---|---|---|
| 0 | Bob | Accounting | 2008 | Carly |
| 1 | Jake | Engineering | 2012 | Guido |
| 2 | Lisa | Engineering | 2004 | Guido |
| 3 | Sue | HR | 2014 | Steve |
多对多链接¶
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
df1
| employee | group | |
|---|---|---|
| 0 | Bob | Accounting |
| 1 | Jake | Engineering |
| 2 | Lisa | Engineering |
| 3 | Sue | HR |
df5
| group | skills | |
|---|---|---|
| 0 | Accounting | math |
| 1 | Accounting | spreadsheets |
| 2 | Engineering | coding |
| 3 | Engineering | linux |
| 4 | HR | spreadsheets |
| 5 | HR | organization |
pd.merge(df1, df5)
| employee | group | skills | |
|---|---|---|---|
| 0 | Bob | Accounting | math |
| 1 | Bob | Accounting | spreadsheets |
| 2 | Jake | Engineering | coding |
| 3 | Jake | Engineering | linux |
| 4 | Lisa | Engineering | coding |
| 5 | Lisa | Engineering | linux |
| 6 | Sue | HR | spreadsheets |
| 7 | Sue | HR | organization |
设置数据合并的键¶
参数 on 用法¶
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
df1
| employee | group | |
|---|---|---|
| 0 | Bob | Accounting |
| 1 | Jake | Engineering |
| 2 | Lisa | Engineering |
| 3 | Sue | HR |
df2
| employee | hire_date | |
|---|---|---|
| 0 | Lisa | 2004 |
| 1 | Bob | 2008 |
| 2 | Jake | 2012 |
| 3 | Sue | 2014 |
pd.merge(df1, df2, on='employee')
| employee | group | hire_date | |
|---|---|---|---|
| 0 | Bob | Accounting | 2008 |
| 1 | Jake | Engineering | 2012 |
| 2 | Lisa | Engineering | 2004 |
| 3 | Sue | HR | 2014 |
left_on 和 right_on 用法¶
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
df1
| employee | group | |
|---|---|---|
| 0 | Bob | Accounting |
| 1 | Jake | Engineering |
| 2 | Lisa | Engineering |
| 3 | Sue | HR |
df3
| name | salary | |
|---|---|---|
| 0 | Bob | 70000 |
| 1 | Jake | 80000 |
| 2 | Lisa | 120000 |
| 3 | Sue | 90000 |
pd.merge(df1, df3, left_on="employee", right_on="name")
| employee | group | name | salary | |
|---|---|---|---|---|
| 0 | Bob | Accounting | Bob | 70000 |
| 1 | Jake | Engineering | Jake | 80000 |
| 2 | Lisa | Engineering | Lisa | 120000 |
| 3 | Sue | HR | Sue | 90000 |
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
| employee | group | salary | |
|---|---|---|---|
| 0 | Bob | Accounting | 70000 |
| 1 | Jake | Engineering | 80000 |
| 2 | Lisa | Engineering | 120000 |
| 3 | Sue | HR | 90000 |
left_index 和 right_index用法¶
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
df1a
| group | |
|---|---|
| employee | |
| Bob | Accounting |
| Jake | Engineering |
| Lisa | Engineering |
| Sue | HR |
df2a
| hire_date | |
|---|---|
| employee | |
| Lisa | 2004 |
| Bob | 2008 |
| Jake | 2012 |
| Sue | 2014 |
display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")
df1a
| group | |
|---|---|
| employee | |
| Bob | Accounting |
| Jake | Engineering |
| Lisa | Engineering |
| Sue | HR |
df2a
| hire_date | |
|---|---|
| employee | |
| Lisa | 2004 |
| Bob | 2008 |
| Jake | 2012 |
| Sue | 2014 |
pd.merge(df1a, df2a, left_index=True, right_index=True)
| group | hire_date | |
|---|---|---|
| employee | ||
| Lisa | Engineering | 2004 |
| Bob | Accounting | 2008 |
| Jake | Engineering | 2012 |
| Sue | HR | 2014 |
为了更方便,DataFrame有一个join() 方法, 可以执行merge类似的功能:
display('df1a', 'df2a', 'df1a.join(df2a)')
df1a
| group | |
|---|---|
| employee | |
| Bob | Accounting |
| Jake | Engineering |
| Lisa | Engineering |
| Sue | HR |
df2a
| hire_date | |
|---|---|
| employee | |
| Lisa | 2004 |
| Bob | 2008 |
| Jake | 2012 |
| Sue | 2014 |
df1a.join(df2a)
| group | hire_date | |
|---|---|---|
| employee | ||
| Bob | Accounting | 2008 |
| Jake | Engineering | 2012 |
| Lisa | Engineering | 2004 |
| Sue | HR | 2014 |
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
df1a
| group | |
|---|---|
| employee | |
| Bob | Accounting |
| Jake | Engineering |
| Lisa | Engineering |
| Sue | HR |
df3
| name | salary | |
|---|---|---|
| 0 | Bob | 70000 |
| 1 | Jake | 80000 |
| 2 | Lisa | 120000 |
| 3 | Sue | 90000 |
pd.merge(df1a, df3, left_index=True, right_on='name')
| group | name | salary | |
|---|---|---|---|
| 0 | Accounting | Bob | 70000 |
| 1 | Engineering | Jake | 80000 |
| 2 | Engineering | Lisa | 120000 |
| 3 | HR | Sue | 90000 |
数据连接的集合操作规则¶
当一个值出现在一列但不在另一列,需要集合规则:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
pd.merge(df6, df7)
| name | food | drink | |
|---|---|---|---|
| 0 | Mary | bread | wine |
只有一个交集:Mary,这是因为合并时取两个数据的交集,这种方法被称为“内连接”。
pd.merge(df6, df7, how='inner')
| name | food | drink | |
|---|---|---|---|
| 0 | Mary | bread | wine |
其他的 how参数有 'outer', 'left', and 'right',示例如下:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
pd.merge(df6, df7, how='outer')
| name | food | drink | |
|---|---|---|---|
| 0 | Peter | fish | NaN |
| 1 | Paul | beans | NaN |
| 2 | Mary | bread | wine |
| 3 | Joseph | NaN | beer |
左连接与右链接
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
pd.merge(df6, df7, how='left')
| name | food | drink | |
|---|---|---|---|
| 0 | Peter | fish | NaN |
| 1 | Paul | beans | NaN |
| 2 | Mary | bread | wine |
重复列名 suffixes 参数¶
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
df8
| name | rank | |
|---|---|---|
| 0 | Bob | 1 |
| 1 | Jake | 2 |
| 2 | Lisa | 3 |
| 3 | Sue | 4 |
df9
| name | rank | |
|---|---|---|
| 0 | Bob | 3 |
| 1 | Jake | 1 |
| 2 | Lisa | 4 |
| 3 | Sue | 2 |
pd.merge(df8, df9, on="name")
| name | rank_x | rank_y | |
|---|---|---|---|
| 0 | Bob | 1 | 3 |
| 1 | Jake | 2 | 1 |
| 2 | Lisa | 3 | 4 |
| 3 | Sue | 4 | 2 |
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
df8
| name | rank | |
|---|---|---|
| 0 | Bob | 1 |
| 1 | Jake | 2 |
| 2 | Lisa | 3 |
| 3 | Sue | 4 |
df9
| name | rank | |
|---|---|---|
| 0 | Bob | 3 |
| 1 | Jake | 1 |
| 2 | Lisa | 4 |
| 3 | Sue | 2 |
pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])
| name | rank_L | rank_R | |
|---|---|---|---|
| 0 | Bob | 1 | 3 |
| 1 | Jake | 2 | 1 |
| 2 | Lisa | 3 | 4 |
| 3 | Sue | 4 | 2 |
案例: 美国各州数据¶
不同数据源得到的数据需要通过合并方法,整合在一起。在本例训练中,我们从三个数据源进行汇总处理。 文件可以在下面网址中下载: http://github.com/jakevdp/data-USstates/
# Following are shell commands to download the data
#!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
将数据下载存储到data文件夹中。使用pd.read_csv()调用。
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
display('pop.head()', 'areas.head()', 'abbrevs.head()')
pop.head()
| state/region | ages | year | population | |
|---|---|---|---|---|
| 0 | AL | under18 | 2012 | 1117489.0 |
| 1 | AL | total | 2012 | 4817528.0 |
| 2 | AL | under18 | 2010 | 1130966.0 |
| 3 | AL | total | 2010 | 4785570.0 |
| 4 | AL | under18 | 2011 | 1125763.0 |
areas.head()
| state | area (sq. mi) | |
|---|---|---|
| 0 | Alabama | 52423 |
| 1 | Alaska | 656425 |
| 2 | Arizona | 114006 |
| 3 | Arkansas | 53182 |
| 4 | California | 163707 |
abbrevs.head()
| state | abbreviation | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
如果我们希望得到一个简单指标:美国各个州人口密度的数据与排名,我们尝试将这几张表结合起来,然后形成一整张数据DataFrame,就可以直接通过计算得到结果。
merged = pd.merge(pop, abbrevs, how='outer',
left_on='state/region', right_on='abbreviation')
merged
| state/region | ages | year | population | state | abbreviation | |
|---|---|---|---|---|---|---|
| 0 | AK | total | 1990 | 553290.0 | Alaska | AK |
| 1 | AK | under18 | 1990 | 177502.0 | Alaska | AK |
| 2 | AK | total | 1992 | 588736.0 | Alaska | AK |
| 3 | AK | under18 | 1991 | 182180.0 | Alaska | AK |
| 4 | AK | under18 | 1992 | 184878.0 | Alaska | AK |
| ... | ... | ... | ... | ... | ... | ... |
| 2539 | WY | under18 | 1993 | 137458.0 | Wyoming | WY |
| 2540 | WY | total | 1991 | 459260.0 | Wyoming | WY |
| 2541 | WY | under18 | 1991 | 136720.0 | Wyoming | WY |
| 2542 | WY | under18 | 1990 | 136078.0 | Wyoming | WY |
| 2543 | WY | total | 1990 | 453690.0 | Wyoming | WY |
2544 rows × 6 columns
merged = merged.drop('abbreviation',axis=1) # drop duplicate info
merged.head()
| state/region | ages | year | population | state | |
|---|---|---|---|---|---|
| 0 | AK | total | 1990 | 553290.0 | Alaska |
| 1 | AK | under18 | 1990 | 177502.0 | Alaska |
| 2 | AK | total | 1992 | 588736.0 | Alaska |
| 3 | AK | under18 | 1991 | 182180.0 | Alaska |
| 4 | AK | under18 | 1992 | 184878.0 | Alaska |
merged.head()
| state/region | ages | year | population | state | |
|---|---|---|---|---|---|
| 0 | AK | total | 1990 | 553290.0 | Alaska |
| 1 | AK | under18 | 1990 | 177502.0 | Alaska |
| 2 | AK | total | 1992 | 588736.0 | Alaska |
| 3 | AK | under18 | 1991 | 182180.0 | Alaska |
| 4 | AK | under18 | 1992 | 184878.0 | Alaska |
检查下数据是否存在缺失。
merged.isnull().any()
state/region False ages False year False population True state True dtype: bool
部分 population 数据是空的。另外关于美国州的变量state也出现了缺失值。为此,可以通过以下方法仔细查看具体细节:
merged[merged['population'].isnull()].head()
| state/region | ages | year | population | state | |
|---|---|---|---|---|---|
| 1872 | PR | under18 | 1990 | NaN | NaN |
| 1873 | PR | total | 1990 | NaN | NaN |
| 1874 | PR | total | 1991 | NaN | NaN |
| 1875 | PR | under18 | 1991 | NaN | NaN |
| 1876 | PR | total | 1993 | NaN | NaN |
好像所有数据缺失都出现在2000之前的波多黎各。估计是此前没有统计波多黎各人口。我们可以再来看看关于州的数据缺失问题:
merged.loc[merged['state'].isnull(), 'state/region'].unique()
array(['PR', 'USA'], dtype=object)
这个问题可以快速解决:由于人口数据中包含波多黎各(PR)和全国总数(USA),但是这两项没有出现在州名缩写表中。可以使用快速充填方法补齐缺失,方法如下:
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
state/region False ages False year False population True state False dtype: bool
这样,在 state 列中已经没有缺失值了。
我们使用类似的方法,将面积数据合并过来:
final = pd.merge(merged, areas, on='state', how='left')
final.head()
| state/region | ages | year | population | state | area (sq. mi) | |
|---|---|---|---|---|---|---|
| 0 | AK | total | 1990 | 553290.0 | Alaska | 656425.0 |
| 1 | AK | under18 | 1990 | 177502.0 | Alaska | 656425.0 |
| 2 | AK | total | 1992 | 588736.0 | Alaska | 656425.0 |
| 3 | AK | under18 | 1991 | 182180.0 | Alaska | 656425.0 |
| 4 | AK | under18 | 1992 | 184878.0 | Alaska | 656425.0 |
同样,我们也需要检查一下数据,看看缺失值问题,还有哪些缺失值没有匹配:
final.isnull().any()
state/region False ages False year False population True state False area (sq. mi) True dtype: bool
可看到 area 列也存在缺失值,通过下面方法查找:
final['state'][final['area (sq. mi)'].isnull()].unique()
array(['United States'], dtype=object)
我们看到 areas DataFrame ,不包含全美国的面积数据。本例中不需要,去掉这个值。方法如下:
final.dropna(inplace=True)
final.head()
| state/region | ages | year | population | state | area (sq. mi) | |
|---|---|---|---|---|---|---|
| 0 | AK | total | 1990 | 553290.0 | Alaska | 656425.0 |
| 1 | AK | under18 | 1990 | 177502.0 | Alaska | 656425.0 |
| 2 | AK | total | 1992 | 588736.0 | Alaska | 656425.0 |
| 3 | AK | under18 | 1991 | 182180.0 | Alaska | 656425.0 |
| 4 | AK | under18 | 1992 | 184878.0 | Alaska | 656425.0 |
现在我们得到了所需要的数据表。为了实现结果,先选择2010年各州人口和总人口数据,使用query()函数快速查询。
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()
| state/region | ages | year | population | state | area (sq. mi) | |
|---|---|---|---|---|---|---|
| 43 | AK | total | 2010 | 713868.0 | Alaska | 656425.0 |
| 51 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
| 141 | AR | total | 2010 | 2922280.0 | Arkansas | 53182.0 |
| 149 | AZ | total | 2010 | 6408790.0 | Arizona | 114006.0 |
| 197 | CA | total | 2010 | 37333601.0 | California | 163707.0 |
现在计算人口密度并按照顺序排列。首先先对索引进行重置,然后计算结果:
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()
state District of Columbia 8898.897059 Puerto Rico 1058.665149 New Jersey 1009.253268 Rhode Island 681.339159 Connecticut 645.600649 dtype: float64
我们可以看到,华盛顿特区最高。同样可以查看人口密度最小的州:
density.tail()
state South Dakota 10.583512 North Dakota 9.537565 Montana 6.736171 Wyoming 5.768079 Alaska 1.087509 dtype: float64
很明显,阿拉斯加是2010年美国人口密度最低的州,刚刚超过1万人/平方英里。
在现实数据处理,合并这种脏乱的数据是非常常见的任务,在实际处理中需要认真了解数据情况,结合项目目标,仔细分析处理。