数组的组合: Concat 和 Append方法¶
不同数据会合在一起,在pandas中的操作方法
import pandas as pd
import numpy as np
后面会用到的一个数据框。
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
| A | B | C | |
|---|---|---|---|
| 0 | A0 | B0 | C0 |
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
#显示函数
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
回顾: Numpy的数组使用的Concatenation 方法¶
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
第一个参数是合并的对象,第二个参数是合并方向,用axis表示。
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
使用 pd.concat进行简单的合并¶
Pandas 的 pd.concat()函数类似np.concatenate ,但有更多参数选择。
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A 2 B 3 C 4 D 5 E 6 F dtype: object
也可以合并高维数据,如DataFrame结构的数组:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df1
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
df2
| A | B | |
|---|---|---|
| 3 | A3 | B3 |
| 4 | A4 | B4 |
pd.concat([df1, df2])
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
| 4 | A4 | B4 |
默认情况下pd.concat是逐行进行的( axis=0),与 np.concatenate用法类似, pd.concat 允许设置axis参数,以确定合并方向。
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")
df3
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
df4
| C | D | |
|---|---|---|
| 0 | C0 | D0 |
| 1 | C1 | D1 |
pd.concat([df3, df4], axis=1)
| A | B | C | D | |
|---|---|---|---|---|
| 0 | A0 | B0 | C0 | D0 |
| 1 | A1 | B1 | C1 | D1 |
重复索引¶
np.concatenate 与 pd.concat 最大的差异在于pandas合并会保留索引 这样可能产生索引重复的问题。
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
x
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
y
| A | B | |
|---|---|---|
| 0 | A2 | B2 |
| 1 | A3 | B3 |
pd.concat([x, y])
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
| 0 | A2 | B2 |
| 1 | A3 | B3 |
虽然合法,但不是我们想要的。在pd.concat() 中给了我们几种方法。
捕捉重复索引作为一个错误¶
为了发现合并后数据是否存在重复索引,可以设置 verify_integrity 参数。如果设置为True,那么出现重复索引会产生异常。
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
ValueError: Indexes have overlapping values: Index([0, 1], dtype='int64')
忽略索引¶
如果有时索引是否重复并不重要,我们可以通过 ignore_index 参数将其忽略。如下所示:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
x
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
y
| A | B | |
|---|---|---|
| 0 | A2 | B2 |
| 1 | A3 | B3 |
pd.concat([x, y], ignore_index=True)
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
增加多级索引¶
另一种方法是通过pd.concat方法的Key参数,将数据源设置为多级索引标签。
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
x
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
y
| A | B | |
|---|---|---|
| 0 | A2 | B2 |
| 1 | A3 | B3 |
pd.concat([x, y], keys=['x', 'y'])
| A | B | ||
|---|---|---|---|
| x | 0 | A0 | B0 |
| 1 | A1 | B1 | |
| y | 0 | A2 | B2 |
| 1 | A3 | B3 |
The result is a multiply indexed DataFrame, and we can use the tools discussed in Hierarchical Indexing to transform this data into the representation we're interested in.
并集合并与交集合并¶
当两个数据框索引不同时,情况发生变化,默认情况是并集合并,即出现一些缺失值。如下:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
df5
| A | B | C | |
|---|---|---|---|
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
df6
| B | C | D | |
|---|---|---|---|
| 3 | B3 | C3 | D3 |
| 4 | B4 | C4 | D4 |
pd.concat([df5, df6])
| A | B | C | D | |
|---|---|---|---|---|
| 1 | A1 | B1 | C1 | NaN |
| 2 | A2 | B2 | C2 | NaN |
| 3 | NaN | B3 | C3 | D3 |
| 4 | NaN | B4 | C4 | D4 |
可以使用join='inner'表示交集合并,仅仅将有相同索引的数值合并:
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
df5
| A | B | C | |
|---|---|---|---|
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
df6
| B | C | D | |
|---|---|---|---|
| 3 | B3 | C3 | D3 |
| 4 | B4 | C4 | D4 |
pd.concat([df5, df6], join='inner')
| B | C | |
|---|---|---|
| 1 | B1 | C1 |
| 2 | B2 | C2 |
| 3 | B3 | C3 |
| 4 | B4 | C4 |
另一种选项是将合并后的数据,依靠一某个合并前的数据行或者列索引,应用 join_axes 参数设置,确定形态。
目前2.0中已取消。但在1版本中可以使用
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
display('df5', 'df6',
"pd.concat([df5, df6], names=[df5.columns])")
df5
| A | B | C | |
|---|---|---|---|
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
df6
| B | C | D | |
|---|---|---|---|
| 3 | B3 | C3 | D3 |
| 4 | B4 | C4 | D4 |
pd.concat([df5, df6], names=[df5.columns])
| A | B | C | D | |
|---|---|---|---|---|
| 1 | A1 | B1 | C1 | NaN |
| 2 | A2 | B2 | C2 | NaN |
| 3 | NaN | B3 | C3 | D3 |
| 4 | NaN | B4 | C4 | D4 |
append() 方法¶
新的1.X以后取消。
display('df1', 'df2', 'df1.append(df2)')
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) File d:\anaconda3\Lib\site-packages\IPython\core\formatters.py:711, in PlainTextFormatter.__call__(self, obj) 704 stream = StringIO() 705 printer = pretty.RepresentationPrinter(stream, self.verbose, 706 self.max_width, self.newline, 707 max_seq_length=self.max_seq_length, 708 singleton_pprinters=self.singleton_printers, 709 type_pprinters=self.type_printers, 710 deferred_pprinters=self.deferred_printers) --> 711 printer.pretty(obj) 712 printer.flush() 713 return stream.getvalue() File d:\anaconda3\Lib\site-packages\IPython\lib\pretty.py:411, in RepresentationPrinter.pretty(self, obj) 408 return meth(obj, self, cycle) 409 if cls is not object \ 410 and callable(cls.__dict__.get('__repr__')): --> 411 return _repr_pprint(obj, self, cycle) 413 return _default_pprint(obj, self, cycle) 414 finally: File d:\anaconda3\Lib\site-packages\IPython\lib\pretty.py:779, in _repr_pprint(obj, p, cycle) 777 """A pprint that just redirects to the normal repr function.""" 778 # Find newlines and replace them with p.break_() --> 779 output = repr(obj) 780 lines = output.splitlines() 781 with p.group(): Cell In[22], line 15, in display.__repr__(self) 14 def __repr__(self): ---> 15 return '\n\n'.join(a + '\n' + repr(eval(a)) 16 for a in self.args) Cell In[22], line 15, in <genexpr>(.0) 14 def __repr__(self): ---> 15 return '\n\n'.join(a + '\n' + repr(eval(a)) 16 for a in self.args) File <string>:1 File d:\anaconda3\Lib\site-packages\pandas\core\generic.py:6299, in NDFrame.__getattr__(self, name) 6292 if ( 6293 name not in self._internal_names_set 6294 and name not in self._metadata 6295 and name not in self._accessors 6296 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6297 ): 6298 return self[name] -> 6299 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'append'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) File d:\anaconda3\Lib\site-packages\IPython\core\formatters.py:347, in BaseFormatter.__call__(self, obj) 345 method = get_real_method(obj, self.print_method) 346 if method is not None: --> 347 return method() 348 return None 349 else: Cell In[22], line 11, in display._repr_html_(self) 10 def _repr_html_(self): ---> 11 return '\n'.join(self.template.format(a, eval(a)._repr_html_()) 12 for a in self.args) Cell In[22], line 11, in <genexpr>(.0) 10 def _repr_html_(self): ---> 11 return '\n'.join(self.template.format(a, eval(a)._repr_html_()) 12 for a in self.args) File <string>:1 File d:\anaconda3\Lib\site-packages\pandas\core\generic.py:6299, in NDFrame.__getattr__(self, name) 6292 if ( 6293 name not in self._internal_names_set 6294 and name not in self._metadata 6295 and name not in self._accessors 6296 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6297 ): 6298 return self[name] -> 6299 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'append'