Pandas基本对象介绍¶
Pandas有三种基本数据结构:Series,DataFrame和Index,下面简单介绍基本特点。
In [3]:
Copied!
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
Pandas 的Series对象¶
In [6]:
Copied!
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
Out[6]:
0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64
In [8]:
Copied!
data.values
data.values
Out[8]:
array([0.25, 0.5 , 0.75, 1. ])
In [10]:
Copied!
data.index
data.index
Out[10]:
RangeIndex(start=0, stop=4, step=1)
In [12]:
Copied!
data[1]
data[1]
Out[12]:
0.5
In [14]:
Copied!
data[1:3]
data[1:3]
Out[14]:
1 0.50 2 0.75 dtype: float64
In [17]:
Copied!
#指定索引和键值
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
#指定索引和键值
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[17]:
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
In [21]:
Copied!
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
Out[21]:
2 0.25 5 0.50 3 0.75 7 1.00 dtype: float64
In [23]:
Copied!
data[5]
data[5]
Out[23]:
0.5
In [25]:
Copied!
#换回原来格式#
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
#换回原来格式#
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[25]:
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
In [27]:
Copied!
data['b']
data['b']
Out[27]:
0.5
In [29]:
Copied!
'a' in data
'a' in data
Out[29]:
True
In [31]:
Copied!
data.keys()
data.keys()
Out[31]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [13]:
Copied!
list(data.items())
list(data.items())
Out[13]:
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
In [33]:
Copied!
#Series元素可以直接赋值
data['e'] = 1.25
data
#Series元素可以直接赋值
data['e'] = 1.25
data
Out[33]:
a 0.25 b 0.50 c 0.75 d 1.00 e 1.25 dtype: float64
In [35]:
Copied!
#pd.Series是一种特殊的字典型结构
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
#pd.Series是一种特殊的字典型结构
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
Out[35]:
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
In [37]:
Copied!
population['California']
population['California']
Out[37]:
38332521
In [39]:
Copied!
population['California':'Illinois']
population['California':'Illinois']
Out[39]:
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
创建pd.series方法¶
In [42]:
Copied!
pd.Series([2, 4, 6])
pd.Series([2, 4, 6])
Out[42]:
0 2 1 4 2 6 dtype: int64
In [44]:
Copied!
pd.Series(5, index=[100, 200, 300])
pd.Series(5, index=[100, 200, 300])
Out[44]:
100 5 200 5 300 5 dtype: int64
In [46]:
Copied!
pd.Series({2:'a', 1:'b', 3:'c'})
pd.Series({2:'a', 1:'b', 3:'c'})
Out[46]:
2 a 1 b 3 c dtype: object
In [48]:
Copied!
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
Out[48]:
3 c 2 a dtype: object
Pandas的DataFrame结构¶
In [51]:
Copied!
#创建另一个Series:面积
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
#创建另一个Series:面积
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
Out[51]:
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 dtype: int64
In [53]:
Copied!
#将上面的population 与刚刚建立的area结合起来
states = pd.DataFrame({'population': population,
'area': area})
states
#将上面的population 与刚刚建立的area结合起来
states = pd.DataFrame({'population': population,
'area': area})
states
Out[53]:
| population | area | |
|---|---|---|
| California | 38332521 | 423967 |
| Texas | 26448193 | 695662 |
| New York | 19651127 | 141297 |
| Florida | 19552860 | 170312 |
| Illinois | 12882135 | 149995 |
In [55]:
Copied!
states.index
states.index
Out[55]:
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
In [57]:
Copied!
states.columns
states.columns
Out[57]:
Index(['population', 'area'], dtype='object')
In [59]:
Copied!
#选取每一列,这样就变成了series
states['population']
#选取每一列,这样就变成了series
states['population']
Out[59]:
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 Name: population, dtype: int64
In [61]:
Copied!
states.population
states.population
Out[61]:
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 Name: population, dtype: int64
上述两种方法等效,但又有不同应用场景。
Pandas 中构建 DataFrame 方法¶
In [65]:
Copied!
#从series中创建
pd.DataFrame(population, columns=['population'])
#从series中创建
pd.DataFrame(population, columns=['population'])
Out[65]:
| population | |
|---|---|
| California | 38332521 |
| Texas | 26448193 |
| New York | 19651127 |
| Florida | 19552860 |
| Illinois | 12882135 |
In [69]:
Copied!
population #Series 与 DataFrame是不同的数据结构
population #Series 与 DataFrame是不同的数据结构
Out[69]:
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
In [71]:
Copied!
#从字典结构的列表进行构建
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
#从字典结构的列表进行构建
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[71]:
| a | b | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 2 |
| 2 | 2 | 4 |
In [73]:
Copied!
#数据缺失不影响dataFrame构建
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
#数据缺失不影响dataFrame构建
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Out[73]:
| a | b | c | |
|---|---|---|---|
| 0 | 1.0 | 2 | NaN |
| 1 | NaN | 3 | 4.0 |
In [75]:
Copied!
#由Series构成的字典结构构建DataFrame
pd.DataFrame({'population': population,
'area': area})
#由Series构成的字典结构构建DataFrame
pd.DataFrame({'population': population,
'area': area})
Out[75]:
| population | area | |
|---|---|---|
| California | 38332521 | 423967 |
| Texas | 26448193 | 695662 |
| New York | 19651127 | 141297 |
| Florida | 19552860 | 170312 |
| Illinois | 12882135 | 149995 |
In [77]:
Copied!
# 二维Numpy数组构建DataFrame
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
# 二维Numpy数组构建DataFrame
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
Out[77]:
| foo | bar | |
|---|---|---|
| a | 0.321930 | 0.391245 |
| b | 0.228380 | 0.512090 |
| c | 0.452195 | 0.533674 |
Pandas 的 Index结构¶
无论是Series还是DataFrame,都有一个Index作为索引。Index在Pandas中是一种重要的辅助对象,有必要有所了解。
In [81]:
Copied!
ind = pd.Index([2, 3, 5, 7, 11])
ind
ind = pd.Index([2, 3, 5, 7, 11])
ind
Out[81]:
Index([2, 3, 5, 7, 11], dtype='int64')
Index 是一种有序的集合对象
In [94]:
Copied!
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
In [96]:
Copied!
indA & indB # 交集
indA & indB # 交集
Out[96]:
Index([0, 3, 5, 7, 9], dtype='int64')
In [88]:
Copied!
indA | indB # 并集
indA | indB # 并集
Out[88]:
Index([3, 3, 5, 7, 11], dtype='int64')
In [92]:
Copied!
indA ^ indB # 亦或
indA ^ indB # 亦或
Out[92]:
Index([3, 0, 0, 0, 2], dtype='int64')
In [ ]:
Copied!