数组的组合: Concat 和 Append方法¶

不同数据会合在一起，在pandas中的操作方法

In [2]:

Copied!

import pandas as pd
import numpy as np
import pandas as pd
import numpy as np

后面会用到的一个数据框。

In [4]:

Copied!





def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))

Out[4]:

	A	B	C
0	A0	B0	C0
1	A1	B1	C1
2	A2	B2	C2

In [22]:

Copied!





#显示函数
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
#显示函数
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    

回顾: Numpy的数组使用的Concatenation 方法¶

In [4]:

Copied!





x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

Out[4]:

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

第一个参数是合并的对象，第二个参数是合并方向，用axis表示。

In [12]:

Copied!

x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)
x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)

Out[12]:

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

使用 `pd.concat`进行简单的合并¶

Pandas 的 pd.concat()函数类似np.concatenate ，但有更多参数选择。

# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

In [17]:

Copied!

ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

Out[17]:

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

也可以合并高维数据，如DataFrame结构的数组:

In [20]:

Copied!

df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

Out[20]:

df1

	A	B
1	A1	B1
2	A2	B2

df2

	A	B
3	A3	B3
4	A4	B4

pd.concat([df1, df2])

	A	B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4

默认情况下pd.concat是逐行进行的( axis=0)，与 np.concatenate用法类似, pd.concat 允许设置axis参数，以确定合并方向。

In [27]:

Copied!

df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")

Out[27]:

df3

	A	B
0	A0	B0
1	A1	B1

df4

	C	D
0	C0	D0
1	C1	D1

pd.concat([df3, df4], axis=1)

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1

重复索引¶

np.concatenate 与 pd.concat 最大的差异在于pandas合并会保留索引 这样可能产生索引重复的问题。

In [30]:

Copied!





x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')

Out[30]:

x

	A	B
0	A0	B0
1	A1	B1

y

	A	B
0	A2	B2
1	A3	B3

pd.concat([x, y])

	A	B
0	A0	B0
1	A1	B1
0	A2	B2
1	A3	B3

虽然合法，但不是我们想要的。在pd.concat() 中给了我们几种方法。

捕捉重复索引作为一个错误¶

为了发现合并后数据是否存在重复索引，可以设置 verify_integrity 参数。如果设置为True,那么出现重复索引会产生异常。

In [34]:

Copied!





try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Index([0, 1], dtype='int64')

忽略索引¶

如果有时索引是否重复并不重要，我们可以通过 ignore_index 参数将其忽略。如下所示：

In [37]:

Copied!

display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

Out[37]:

x

	A	B
0	A0	B0
1	A1	B1

y

	A	B
0	A2	B2
1	A3	B3

pd.concat([x, y], ignore_index=True)

	A	B
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3

增加多级索引¶

另一种方法是通过pd.concat方法的Key参数，将数据源设置为多级索引标签。

In [40]:

Copied!

display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")

Out[40]:

x

	A	B
0	A0	B0
1	A1	B1

y

	A	B
0	A2	B2
1	A3	B3

pd.concat([x, y], keys=['x', 'y'])

		A	B
x	0	A0	B0
x	1	A1	B1
y	0	A2	B2
y	1	A3	B3

The result is a multiply indexed DataFrame, and we can use the tools discussed in Hierarchical Indexing to transform this data into the representation we're interested in.

并集合并与交集合并¶

当两个数据框索引不同时，情况发生变化，默认情况是并集合并，即出现一些缺失值。如下：

In [44]:

Copied!

df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

Out[44]:

df5

	A	B	C
1	A1	B1	C1
2	A2	B2	C2

df6

	B	C	D
3	B3	C3	D3
4	B4	C4	D4

pd.concat([df5, df6])

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	NaN	B3	C3	D3
4	NaN	B4	C4	D4

可以使用join='inner'表示交集合并，仅仅将有相同索引的数值合并:

In [47]:

Copied!

display('df5', 'df6',
        "pd.concat([df5, df6], join='inner')")
display('df5', 'df6',
        "pd.concat([df5, df6], join='inner')")

Out[47]:

df5

	A	B	C
1	A1	B1	C1
2	A2	B2	C2

df6

	B	C	D
3	B3	C3	D3
4	B4	C4	D4

pd.concat([df5, df6], join='inner')

	B	C
1	B1	C1
2	B2	C2
3	B3	C3
4	B4	C4

另一种选项是将合并后的数据，依靠一某个合并前的数据行或者列索引，应用 join_axes 参数设置，确定形态。目前2.0中已取消。但在1版本中可以使用

display('df5', 'df6',
        "pd.concat([df5, df6], join_axes=[df5.columns])")

In [60]:

Copied!

display('df5', 'df6',
        "pd.concat([df5, df6], names=[df5.columns])")
display('df5', 'df6',
        "pd.concat([df5, df6], names=[df5.columns])")

Out[60]:

df5

	A	B	C
1	A1	B1	C1
2	A2	B2	C2

df6

	B	C	D
3	B3	C3	D3
4	B4	C4	D4

pd.concat([df5, df6], names=[df5.columns])

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	NaN	B3	C3	D3
4	NaN	B4	C4	D4

`append()` 方法¶

新的1.X以后取消。

In [64]:

Copied!

display('df1', 'df2', 'df1.append(df2)')
display('df1', 'df2', 'df1.append(df2)')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File d:\anaconda3\Lib\site-packages\IPython\core\formatters.py:711, in PlainTextFormatter.__call__(self, obj)
    704 stream = StringIO()
    705 printer = pretty.RepresentationPrinter(stream, self.verbose,
    706     self.max_width, self.newline,
    707     max_seq_length=self.max_seq_length,
    708     singleton_pprinters=self.singleton_printers,
    709     type_pprinters=self.type_printers,
    710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
    712 printer.flush()
    713 return stream.getvalue()

File d:\anaconda3\Lib\site-packages\IPython\lib\pretty.py:411, in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if cls is not object \
    410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
    413     return _default_pprint(obj, self, cycle)
    414 finally:

File d:\anaconda3\Lib\site-packages\IPython\lib\pretty.py:779, in _repr_pprint(obj, p, cycle)
    777 """A pprint that just redirects to the normal repr function."""
    778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
    780 lines = output.splitlines()
    781 with p.group():

Cell In[22], line 15, in display.__repr__(self)
     14 def __repr__(self):
---> 15     return '\n\n'.join(a + '\n' + repr(eval(a))
     16                        for a in self.args)

Cell In[22], line 15, in <genexpr>(.0)
     14 def __repr__(self):
---> 15     return '\n\n'.join(a + '\n' + repr(eval(a))
     16                        for a in self.args)

File <string>:1

File d:\anaconda3\Lib\site-packages\pandas\core\generic.py:6299, in NDFrame.__getattr__(self, name)
   6292 if (
   6293     name not in self._internal_names_set
   6294     and name not in self._metadata
   6295     and name not in self._accessors
   6296     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6297 ):
   6298     return self[name]
-> 6299 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File d:\anaconda3\Lib\site-packages\IPython\core\formatters.py:347, in BaseFormatter.__call__(self, obj)
    345     method = get_real_method(obj, self.print_method)
    346     if method is not None:
--> 347         return method()
    348     return None
    349 else:

Cell In[22], line 11, in display._repr_html_(self)
     10 def _repr_html_(self):
---> 11     return '\n'.join(self.template.format(a, eval(a)._repr_html_())
     12                      for a in self.args)

Cell In[22], line 11, in <genexpr>(.0)
     10 def _repr_html_(self):
---> 11     return '\n'.join(self.template.format(a, eval(a)._repr_html_())
     12                      for a in self.args)

File <string>:1

File d:\anaconda3\Lib\site-packages\pandas\core\generic.py:6299, in NDFrame.__getattr__(self, name)
   6292 if (
   6293     name not in self._internal_names_set
   6294     and name not in self._metadata
   6295     and name not in self._accessors
   6296     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6297 ):
   6298     return self[name]
-> 6299 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'