基于Pandas、Matplotlib、Seaborn的可视化扩展¶
虽然Pandas核心功能在于数组处理,但Pandas模块中包含了一些可视化功能,可以在不是很复杂的情况下,绘制出数据的可视化图形。
In [4]:
Copied!
import pandas as pd
import numpy as np
import pyarrow as pa
import pandas as pd
import numpy as np
import pyarrow as pa
导论¶
In [6]:
Copied!
import matplotlib.pyplot as plt
plt.ion();
import matplotlib.pyplot as plt
plt.ion();
从汇总数据创建图表¶
具体做法:¶
In [8]:
Copied!
ser = pd.Series(
(x ** 2 for x in range(7)),
name="book_sales",
index=(f"Day {x + 1}" for x in range(7)),
dtype=pd.Int64Dtype(),
)
ser
ser = pd.Series(
(x ** 2 for x in range(7)),
name="book_sales",
index=(f"Day {x + 1}" for x in range(7)),
dtype=pd.Int64Dtype(),
)
ser
Out[8]:
Day 1 0 Day 2 1 Day 3 4 Day 4 9 Day 5 16 Day 6 25 Day 7 36 Name: book_sales, dtype: Int64
In [10]:
Copied!
ser.plot();
ser.plot();
In [12]:
Copied!
ser.plot(kind="bar");
ser.plot(kind="bar");
In [14]:
Copied!
ser.plot(kind="barh");
ser.plot(kind="barh");
In [16]:
Copied!
ser.plot(kind="area");
ser.plot(kind="area");
In [20]:
Copied!
ser.plot(kind="pie");
ser.plot(kind="pie");
In [24]:
Copied!
df = pd.DataFrame({
"book_sales": (x ** 2 for x in range(7)),
"book_returns": [3, 2, 1, 0, 1, 2, 3],
}, index=(f"Day {x + 1}" for x in range(7)))
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
df = pd.DataFrame({
"book_sales": (x ** 2 for x in range(7)),
"book_returns": [3, 2, 1, 0, 1, 2, 3],
}, index=(f"Day {x + 1}" for x in range(7)))
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
Out[24]:
| book_sales | book_returns | |
|---|---|---|
| Day 1 | 0 | 3 |
| Day 2 | 1 | 2 |
| Day 3 | 4 | 1 |
| Day 4 | 9 | 0 |
| Day 5 | 16 | 1 |
| Day 6 | 25 | 2 |
| Day 7 | 36 | 3 |
In [26]:
Copied!
df.plot();
df.plot();
In [28]:
Copied!
df.plot(kind="bar");
df.plot(kind="bar");
In [12]:
Copied!
df.plot(kind="bar", stacked=True)
df.plot(kind="bar", stacked=True)
Out[12]:
<Axes: >
In [30]:
Copied!
df.plot(kind="barh");
df.plot(kind="barh");
In [32]:
Copied!
df.plot(kind="barh", stacked=True);
df.plot(kind="barh", stacked=True);
In [15]:
Copied!
df.plot(kind="area")
df.plot(kind="area")
Out[15]:
<Axes: >
In [16]:
Copied!
df.plot(kind="area", stacked=False, alpha=0.5)
df.plot(kind="area", stacked=False, alpha=0.5)
Out[16]:
<Axes: >
更多内容…¶
In [37]:
Copied!
ser.plot(
kind="bar",
title="Book Sales by Day",
)
ser.plot(
kind="bar",
title="Book Sales by Day",
)
Out[37]:
<Axes: title={'center': 'Book Sales by Day'}>
In [18]:
Copied!
ser.plot(
kind="bar",
title="Book Sales by Day",
color="seagreen",
)
ser.plot(
kind="bar",
title="Book Sales by Day",
color="seagreen",
)
Out[18]:
<Axes: title={'center': 'Book Sales by Day'}>
In [19]:
Copied!
df.plot(
kind="bar",
title="Book Metrics",
color={
"book_sales": "slateblue",
"book_returns": "#7D5260",
}
)
df.plot(
kind="bar",
title="Book Metrics",
color={
"book_sales": "slateblue",
"book_returns": "#7D5260",
}
)
Out[19]:
<Axes: title={'center': 'Book Metrics'}>
In [20]:
Copied!
ser.plot(
kind="bar",
title="Book Sales by Day",
color="teal",
grid=False,
)
ser.plot(
kind="bar",
title="Book Sales by Day",
color="teal",
grid=False,
)
Out[20]:
<Axes: title={'center': 'Book Sales by Day'}>
In [21]:
Copied!
ser.plot(
kind="bar",
title="Book Sales by Day",
color="darkgoldenrod",
grid=False,
xlabel="Day Number",
ylabel="Book Sales",
)
ser.plot(
kind="bar",
title="Book Sales by Day",
color="darkgoldenrod",
grid=False,
xlabel="Day Number",
ylabel="Book Sales",
)
Out[21]:
<Axes: title={'center': 'Book Sales by Day'}, xlabel='Day Number', ylabel='Book Sales'>
In [22]:
Copied!
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
)
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
)
Out[22]:
array([<Axes: title={'center': 'book_sales'}>,
<Axes: title={'center': 'book_returns'}>], dtype=object)
In [23]:
Copied!
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
legend=False,
)
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
legend=False,
)
Out[23]:
array([<Axes: title={'center': 'book_sales'}>,
<Axes: title={'center': 'book_returns'}>], dtype=object)
In [24]:
Copied!
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
legend=False,
sharey=True,
)
df.plot(
kind="bar",
title="Book Performance",
grid=False,
subplots=True,
legend=False,
sharey=True,
)
Out[24]:
array([<Axes: title={'center': 'book_sales'}>,
<Axes: title={'center': 'book_returns'}>], dtype=object)
In [25]:
Copied!
df.plot(
kind="barh",
y=["book_returns"],
title="Book Returns",
legend=False,
grid=False,
color="seagreen",
)
df.plot(
kind="barh",
y=["book_returns"],
title="Book Returns",
legend=False,
grid=False,
color="seagreen",
)
Out[25]:
<Axes: title={'center': 'Book Returns'}>
绘制非聚合数据的分布¶
具体操作:¶
In [26]:
Copied!
np.random.seed(42)
ser = pd.Series(
np.random.default_rng().normal(size=10_000),
dtype=pd.Float64Dtype(),
)
ser
np.random.seed(42)
ser = pd.Series(
np.random.default_rng().normal(size=10_000),
dtype=pd.Float64Dtype(),
)
ser
Out[26]:
0 -1.136009
1 -0.845098
2 1.85341
3 0.012477
4 -0.075157
...
9995 -0.324093
9996 -0.676388
9997 0.158774
9998 1.674571
9999 0.549696
Length: 10000, dtype: Float64
In [27]:
Copied!
ser.plot(kind="hist")
ser.plot(kind="hist")
Out[27]:
<Axes: ylabel='Frequency'>
In [28]:
Copied!
ser.plot(kind="hist", bins=2)
ser.plot(kind="hist", bins=2)
Out[28]:
<Axes: ylabel='Frequency'>
In [29]:
Copied!
ser.plot(kind="hist", bins=100)
ser.plot(kind="hist", bins=100)
Out[29]:
<Axes: ylabel='Frequency'>
In [30]:
Copied!
np.random.seed(42)
df = pd.DataFrame({
"normal": np.random.default_rng().normal(size=10_000),
"triangular": np.random.default_rng().triangular(-2, 0, 2, size=10_000),
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df.head()
np.random.seed(42)
df = pd.DataFrame({
"normal": np.random.default_rng().normal(size=10_000),
"triangular": np.random.default_rng().triangular(-2, 0, 2, size=10_000),
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df.head()
Out[30]:
| normal | triangular | |
|---|---|---|
| 0 | -1.030928 | -0.366722 |
| 1 | -1.533267 | 0.18168 |
| 2 | 0.157239 | -0.871965 |
| 3 | 0.538866 | -0.370548 |
| 4 | 0.01223 | 0.646524 |
In [31]:
Copied!
df.plot(kind="hist")
df.plot(kind="hist")
Out[31]:
<Axes: ylabel='Frequency'>
In [32]:
Copied!
df.plot(kind="hist", alpha=0.5)
df.plot(kind="hist", alpha=0.5)
Out[32]:
<Axes: ylabel='Frequency'>
In [33]:
Copied!
df.plot(kind="hist", subplots=True)
df.plot(kind="hist", subplots=True)
Out[33]:
array([<Axes: ylabel='Frequency'>, <Axes: ylabel='Frequency'>],
dtype=object)
In [34]:
Copied!
df.plot(kind="hist", alpha=0.5, bins=100)
df.plot(kind="hist", alpha=0.5, bins=100)
Out[34]:
<Axes: ylabel='Frequency'>
In [35]:
Copied!
ser.plot(kind="kde")
ser.plot(kind="kde")
Out[35]:
<Axes: ylabel='Density'>
In [36]:
Copied!
df.plot(kind="kde")
df.plot(kind="kde")
Out[36]:
<Axes: ylabel='Density'>
使用 Matplotlib 进行进一步的图表自定义¶
具体操作:¶
In [37]:
Copied!
ser = pd.Series(
(x ** 2 for x in range(7)),
name="book_sales",
index=(f"Day {x + 1}" for x in range(7)),
dtype=pd.Int64Dtype(),
)
fig, axes = plt.subplots(nrows=1, ncols=3)
ser.plot(ax=axes[0])
ser.plot(kind="bar", ax=axes[1])
ser.plot(kind="pie", ax=axes[2])
ser = pd.Series(
(x ** 2 for x in range(7)),
name="book_sales",
index=(f"Day {x + 1}" for x in range(7)),
dtype=pd.Int64Dtype(),
)
fig, axes = plt.subplots(nrows=1, ncols=3)
ser.plot(ax=axes[0])
ser.plot(kind="bar", ax=axes[1])
ser.plot(kind="pie", ax=axes[2])
Out[37]:
<Axes: ylabel='book_sales'>
In [38]:
Copied!
from matplotlib.gridspec import GridSpec
fig = plt.figure()
gs = GridSpec(2, 2, figure=fig)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ser.plot(ax=ax0)
ser.plot(kind="bar", ax=ax1)
ser.plot(kind="pie", ax=ax2)
from matplotlib.gridspec import GridSpec
fig = plt.figure()
gs = GridSpec(2, 2, figure=fig)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ser.plot(ax=ax0)
ser.plot(kind="bar", ax=ax1)
ser.plot(kind="pie", ax=ax2)
Out[38]:
<Axes: ylabel='book_sales'>
In [39]:
Copied!
from matplotlib.gridspec import GridSpec
fig = plt.figure()
fig.suptitle("Book Sales Visualized in Different Ways")
gs = GridSpec(2, 2, figure=fig, hspace=.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ax0 = ser.plot(ax=ax0)
ax0.set_title("Line chart")
ax1 = ser.plot(kind="bar", ax=ax1)
ax1.set_title("Bar chart")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
# Remove labels from chart and show in custom legend instead
ax2 = ser.plot(kind="pie", ax=ax2, labels=None)
ax2.legend(
ser.index,
bbox_to_anchor=(1, -0.2, 0.5, 1), # put legend to right of chart
prop={"size": 6}, # set font size for legend
)
ax2.set_title("Pie Chart")
ax2.set_ylabel(None) # remove book_sales label
from matplotlib.gridspec import GridSpec
fig = plt.figure()
fig.suptitle("Book Sales Visualized in Different Ways")
gs = GridSpec(2, 2, figure=fig, hspace=.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ax0 = ser.plot(ax=ax0)
ax0.set_title("Line chart")
ax1 = ser.plot(kind="bar", ax=ax1)
ax1.set_title("Bar chart")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
# Remove labels from chart and show in custom legend instead
ax2 = ser.plot(kind="pie", ax=ax2, labels=None)
ax2.legend(
ser.index,
bbox_to_anchor=(1, -0.2, 0.5, 1), # put legend to right of chart
prop={"size": 6}, # set font size for legend
)
ax2.set_title("Pie Chart")
ax2.set_ylabel(None) # remove book_sales label
Out[39]:
Text(0, 0.5, '')
散点图¶
具体操作:¶
In [40]:
Copied!
df = pd.DataFrame({
"var_a": [1, 2, 3, 4, 5],
"var_b": [1, 2, 4, 8, 16],
"var_c": [500, 200, 600, 100, 400],
"var_d": ["blue", "orange", "gray", "blue", "gray"],
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
df = pd.DataFrame({
"var_a": [1, 2, 3, 4, 5],
"var_b": [1, 2, 4, 8, 16],
"var_c": [500, 200, 600, 100, 400],
"var_d": ["blue", "orange", "gray", "blue", "gray"],
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
Out[40]:
| var_a | var_b | var_c | var_d | |
|---|---|---|---|---|
| 0 | 1 | 1 | 500 | blue |
| 1 | 2 | 2 | 200 | orange |
| 2 | 3 | 4 | 600 | gray |
| 3 | 4 | 8 | 100 | blue |
| 4 | 5 | 16 | 400 | gray |
In [41]:
Copied!
df.plot(
kind="scatter",
x="var_a",
y="var_b",
s="var_c",
c="var_d",
)
df.plot(
kind="scatter",
x="var_a",
y="var_b",
s="var_c",
c="var_d",
)
Out[41]:
<Axes: xlabel='var_a', ylabel='var_b'>
In [43]:
Copied!
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
usecols=["city08", "highway08", "VClass", "fuelCost08", "year"],
)
df.head()
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
usecols=["city08", "highway08", "VClass", "fuelCost08", "year"],
)
df.head()
Out[43]:
| city08 | fuelCost08 | highway08 | VClass | year | |
|---|---|---|---|---|---|
| 0 | 19 | 2450 | 25 | Two Seaters | 1985 |
| 1 | 9 | 4700 | 14 | Two Seaters | 1985 |
| 2 | 23 | 1900 | 33 | Subcompact Cars | 1985 |
| 3 | 10 | 4700 | 12 | Vans | 1985 |
| 4 | 17 | 3400 | 23 | Compact Cars | 1993 |
In [47]:
Copied!
car_classes = (
"Subcompact Cars",
"Compact Cars",
"Midsize Cars",
"Large Cars",
"Two Seaters",
)
mask = (df["year"] >= 2015) & df["VClass"].isin(car_classes)
df = df[mask]
df.head()
car_classes = (
"Subcompact Cars",
"Compact Cars",
"Midsize Cars",
"Large Cars",
"Two Seaters",
)
mask = (df["year"] >= 2015) & df["VClass"].isin(car_classes)
df = df[mask]
df.head()
Out[47]:
| city08 | fuelCost08 | highway08 | VClass | year | |
|---|---|---|---|---|---|
| 27058 | 16 | 3400 | 23 | Subcompact Cars | 2015 |
| 27059 | 20 | 2250 | 28 | Compact Cars | 2015 |
| 27060 | 26 | 1700 | 37 | Midsize Cars | 2015 |
| 27061 | 28 | 1600 | 39 | Midsize Cars | 2015 |
| 27062 | 25 | 1800 | 35 | Midsize Cars | 2015 |
In [49]:
Copied!
df.plot(
kind="scatter",
x="city08",
y="highway08",
)
df.plot(
kind="scatter",
x="city08",
y="highway08",
)
Out[49]:
<Axes: xlabel='city08', ylabel='highway08'>
In [51]:
Copied!
classes_ser = pd.Series(car_classes, dtype=pd.StringDtype())
cat = pd.CategoricalDtype(classes_ser)
df["VClass"] = df["VClass"].astype(cat)
df.head()
classes_ser = pd.Series(car_classes, dtype=pd.StringDtype())
cat = pd.CategoricalDtype(classes_ser)
df["VClass"] = df["VClass"].astype(cat)
df.head()
Out[51]:
| city08 | fuelCost08 | highway08 | VClass | year | |
|---|---|---|---|---|---|
| 27058 | 16 | 3400 | 23 | Subcompact Cars | 2015 |
| 27059 | 20 | 2250 | 28 | Compact Cars | 2015 |
| 27060 | 26 | 1700 | 37 | Midsize Cars | 2015 |
| 27061 | 28 | 1600 | 39 | Midsize Cars | 2015 |
| 27062 | 25 | 1800 | 35 | Midsize Cars | 2015 |
In [53]:
Copied!
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
)
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
)
Out[53]:
<Axes: xlabel='city08', ylabel='highway08'>
In [55]:
Copied!
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
)
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
)
Out[55]:
<Axes: xlabel='city08', ylabel='highway08'>
In [57]:
Copied!
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
s="fuelCost08",
)
df.plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
s="fuelCost08",
)
Out[57]:
<Axes: xlabel='city08', ylabel='highway08'>
In [59]:
Copied!
df.assign(
scaled_fuel_cost=lambda x: x["fuelCost08"] / 25,
).plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
s="scaled_fuel_cost",
alpha=0.4,
)
df.assign(
scaled_fuel_cost=lambda x: x["fuelCost08"] / 25,
).plot(
kind="scatter",
x="city08",
y="highway08",
c="VClass",
colormap="Dark2",
s="scaled_fuel_cost",
alpha=0.4,
)
Out[59]:
<Axes: xlabel='city08', ylabel='highway08'>
更多内容 …¶
In [62]:
Copied!
from pandas.plotting import scatter_matrix
scatter_matrix(df);
from pandas.plotting import scatter_matrix
scatter_matrix(df);
探索分类数据¶
具体操作¶
In [66]:
Copied!
df = pd.read_csv("../data/vehicles.csv.zip",
dtype_backend="numpy_nullable")
df.head()
df = pd.read_csv("../data/vehicles.csv.zip",
dtype_backend="numpy_nullable")
df.head()
C:\Users\getwa\AppData\Local\Temp\ipykernel_8564\1310465292.py:1: DtypeWarning: Columns (72,74,75,77) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("../data/vehicles.csv.zip",
Out[66]:
| barrels08 | barrelsA08 | charge120 | charge240 | city08 | city08U | cityA08 | cityA08U | cityCD | cityE | ... | mfrCode | c240Dscr | charge240b | c240bDscr | createdOn | modifiedOn | startStop | phevCity | phevHwy | phevComb | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.167143 | 0.0 | 0.0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 1 | 27.046364 | 0.0 | 0.0 | 0.0 | 9 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 2 | 11.018889 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 3 | 27.046364 | 0.0 | 0.0 | 0.0 | 10 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 4 | 15.658421 | 0.0 | 0.0 | 0.0 | 17 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
5 rows × 84 columns
In [68]:
Copied!
df.iloc[:, [72, 74, 75, 77]]
df.iloc[:, [72, 74, 75, 77]]
Out[68]:
| rangeA | mfrCode | c240Dscr | c240bDscr | |
|---|---|---|---|---|
| 0 | <NA> | <NA> | <NA> | <NA> |
| 1 | <NA> | <NA> | <NA> | <NA> |
| 2 | <NA> | <NA> | <NA> | <NA> |
| 3 | <NA> | <NA> | <NA> | <NA> |
| 4 | <NA> | <NA> | <NA> | <NA> |
| ... | ... | ... | ... | ... |
| 47518 | <NA> | <NA> | <NA> | <NA> |
| 47519 | <NA> | <NA> | <NA> | <NA> |
| 47520 | <NA> | <NA> | <NA> | <NA> |
| 47521 | <NA> | <NA> | <NA> | <NA> |
| 47522 | <NA> | <NA> | <NA> | <NA> |
47523 rows × 4 columns
In [70]:
Copied!
df["rangeA"].value_counts()
df["rangeA"].value_counts()
Out[70]:
rangeA
290 74
270 58
280 56
310 41
277 38
..
45 1
36 1
42 1
327 1
166 1
Name: count, Length: 264, dtype: int64
In [72]:
Copied!
df["rangeA"].str.isnumeric().idxmax()
df["rangeA"].str.isnumeric().idxmax()
Out[72]:
7116
In [74]:
Copied!
df.iloc[:, [74, 75, 77]].pipe(pd.isna).idxmin()
df.iloc[:, [74, 75, 77]].pipe(pd.isna).idxmin()
Out[74]:
mfrCode 23147 c240Dscr 25661 c240bDscr 25661 dtype: int64
In [78]:
Copied!
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
dtype={
"rangeA": pd.StringDtype(),
"mfrCode": pd.StringDtype(),
"c240Dscr": pd.StringDtype(),
"c240bDscr": pd.StringDtype()
}
)
df.head()
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
dtype={
"rangeA": pd.StringDtype(),
"mfrCode": pd.StringDtype(),
"c240Dscr": pd.StringDtype(),
"c240bDscr": pd.StringDtype()
}
)
df.head()
Out[78]:
| barrels08 | barrelsA08 | charge120 | charge240 | city08 | city08U | cityA08 | cityA08U | cityCD | cityE | ... | mfrCode | c240Dscr | charge240b | c240bDscr | createdOn | modifiedOn | startStop | phevCity | phevHwy | phevComb | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.167143 | 0.0 | 0.0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 1 | 27.046364 | 0.0 | 0.0 | 0.0 | 9 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 2 | 11.018889 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 3 | 27.046364 | 0.0 | 0.0 | 0.0 | 10 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 4 | 15.658421 | 0.0 | 0.0 | 0.0 | 17 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
5 rows × 84 columns
In [80]:
Copied!
df.select_dtypes(include=["string"]).columns
df.select_dtypes(include=["string"]).columns
Out[80]:
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model',
'mpgData', 'trany', 'VClass', 'baseModel', 'guzzler', 'trans_dscr',
'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA', 'evMotor',
'mfrCode', 'c240Dscr', 'c240bDscr', 'createdOn', 'modifiedOn',
'startStop'],
dtype='object')
In [82]:
Copied!
df.select_dtypes(include=["string"]).nunique().sort_values()
df.select_dtypes(include=["string"]).nunique().sort_values()
Out[82]:
tCharger 1 sCharger 1 mpgData 2 startStop 2 guzzler 3 fuelType2 4 c240Dscr 5 drive 7 fuelType1 7 c240bDscr 7 atvType 9 fuelType 15 VClass 34 trany 40 trans_dscr 52 mfrCode 56 make 144 rangeA 245 modifiedOn 298 evMotor 400 createdOn 455 eng_dscr 608 baseModel 1451 model 5064 dtype: int64
In [84]:
Copied!
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)
for index, column in enumerate(low_card):
row, col = divmod(index, 3)
ax = axes[row][col]
df[column].value_counts().plot(kind="bar", ax=ax)
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)
for index, column in enumerate(low_card):
row, col = divmod(index, 3)
ax = axes[row][col]
df[column].value_counts().plot(kind="bar", ax=ax)
In [86]:
Copied!
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)
for index, column in enumerate(low_card):
row = index % 3
col = index // 3
ax = axes[row][col]
counts = df[column].value_counts()
counts.set_axis(counts.index.str[:8]).plot(kind="bar", ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=6)
plt.tight_layout()
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)
for index, column in enumerate(low_card):
row = index % 3
col = index // 3
ax = axes[row][col]
counts = df[column].value_counts()
counts.set_axis(counts.index.str[:8]).plot(kind="bar", ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=6)
plt.tight_layout()
探索连续型数据¶
具体操作:¶
In [90]:
Copied!
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
dtype={
"rangeA": pd.StringDtype(),
"mfrCode": pd.StringDtype(),
"c240Dscr": pd.StringDtype(),
"c240bDscr": pd.StringDtype()
}
)
df.head()
df = pd.read_csv(
"../data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
dtype={
"rangeA": pd.StringDtype(),
"mfrCode": pd.StringDtype(),
"c240Dscr": pd.StringDtype(),
"c240bDscr": pd.StringDtype()
}
)
df.head()
Out[90]:
| barrels08 | barrelsA08 | charge120 | charge240 | city08 | city08U | cityA08 | cityA08U | cityCD | cityE | ... | mfrCode | c240Dscr | charge240b | c240bDscr | createdOn | modifiedOn | startStop | phevCity | phevHwy | phevComb | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.167143 | 0.0 | 0.0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 1 | 27.046364 | 0.0 | 0.0 | 0.0 | 9 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 2 | 11.018889 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 3 | 27.046364 | 0.0 | 0.0 | 0.0 | 10 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
| 4 | 15.658421 | 0.0 | 0.0 | 0.0 | 17 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | <NA> | <NA> | 0.0 | <NA> | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | <NA> | 0 | 0 | 0 |
5 rows × 84 columns
In [92]:
Copied!
df.select_dtypes(exclude=["string"]).columns
df.select_dtypes(exclude=["string"]).columns
Out[92]:
Index(['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08',
'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2',
'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U',
'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders',
'displ', 'engId', 'feScore', 'fuelCost08', 'fuelCostA08', 'ghgScore',
'ghgScoreA', 'highway08', 'highway08U', 'highwayA08', 'highwayA08U',
'highwayCD', 'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4',
'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity', 'rangeCityA',
'rangeHwy', 'rangeHwyA', 'UCity', 'UCityA', 'UHighway', 'UHighwayA',
'year', 'youSaveSpend', 'charge240b', 'phevCity', 'phevHwy',
'phevComb'],
dtype='object')
In [94]:
Copied!
df.select_dtypes(
exclude=["string"]
).pipe(pd.isna).sum().sort_values(ascending=False).head()
df.select_dtypes(
exclude=["string"]
).pipe(pd.isna).sum().sort_values(ascending=False).head()
Out[94]:
cylinders 801 displ 799 barrels08 0 barrelsA08 0 city08 0 dtype: int64
In [96]:
Copied!
df.loc[df["cylinders"].isna(), ["make", "model"]].value_counts()
df.loc[df["cylinders"].isna(), ["make", "model"]].value_counts()
Out[96]:
make model
Fiat 500e 8
BYD e6 7
Ford Focus Electric 7
Chevrolet Bolt EV 7
smart fortwo electric drive coupe 7
..
Lucid Air Dream R AWD w/21 inch wheels 1
Air Dream R AWD w/19 inch wheels 1
Audi Q4 40 e-tron 1
Vinfast VF 9 Plus 1
VF 9 Eco 1
Name: count, Length: 450, dtype: int64
In [98]:
Copied!
df["cylinders"] = df["cylinders"].fillna(0)
df["cylinders"] = df["cylinders"].fillna(0)
In [100]:
Copied!
df.loc[df["displ"].isna(), ["make", "model"]].value_counts()
df.loc[df["displ"].isna(), ["make", "model"]].value_counts()
Out[100]:
make model
Fiat 500e 8
Ford Focus Electric 7
Toyota RAV4 EV 7
smart fortwo electric drive coupe 7
Nissan Leaf 7
..
Lexus RZ 450e AWD (20 inch Wheels) 1
RZ 450e AWD (20 inch wheels) 1
Vinfast VF 9 Plus 1
Azure Dynamics Transit Connect Electric Van/Wagon 1
BMW Active E 1
Name: count, Length: 449, dtype: int64
In [102]:
Copied!
df["displ"].nunique()
df["displ"].nunique()
Out[102]:
66
In [104]:
Copied!
df["city08"].plot(kind="hist")
df["city08"].plot(kind="hist")
Out[104]:
<Axes: ylabel='Frequency'>
In [106]:
Copied!
df["city08"].plot(kind="hist", bins=30)
df["city08"].plot(kind="hist", bins=30)
Out[106]:
<Axes: ylabel='Frequency'>
In [108]:
Copied!
fig, axes = plt.subplots(nrows=2, ncols=1)
axes[0].set_xlim(0, 40)
axes[1].set_xlim(0, 40)
df["city08"].plot(kind="kde", ax=axes[0])
df["highway08"].plot(kind="kde", ax=axes[1])
axes[0].set_ylabel("city")
axes[1].set_ylabel("highway")
fig, axes = plt.subplots(nrows=2, ncols=1)
axes[0].set_xlim(0, 40)
axes[1].set_xlim(0, 40)
df["city08"].plot(kind="kde", ax=axes[0])
df["highway08"].plot(kind="kde", ax=axes[1])
axes[0].set_ylabel("city")
axes[1].set_ylabel("highway")
Out[108]:
Text(0, 0.5, 'highway')
使用 seaborn 进行更高级图形绘制¶
In [71]:
Copied!
import seaborn as sns
sns.set_theme()
sns.set_style("white")
import seaborn as sns
sns.set_theme()
sns.set_style("white")
具体操作:¶
In [72]:
Copied!
df = pd.DataFrame([
["Q1-2024", "project_a", 1],
["Q1-2024", "project_b", 1],
["Q2-2024", "project_a", 2],
["Q2-2024", "project_b", 2],
["Q3-2024", "project_a", 4],
["Q3-2024", "project_b", 3],
["Q4-2024", "project_a", 8],
["Q4-2024", "project_b", 4],
["Q1-2025", "project_a", 16],
["Q1-2025", "project_b", 5],
], columns=["quarter", "project", "github_stars"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
df = pd.DataFrame([
["Q1-2024", "project_a", 1],
["Q1-2024", "project_b", 1],
["Q2-2024", "project_a", 2],
["Q2-2024", "project_b", 2],
["Q3-2024", "project_a", 4],
["Q3-2024", "project_b", 3],
["Q4-2024", "project_a", 8],
["Q4-2024", "project_b", 4],
["Q1-2025", "project_a", 16],
["Q1-2025", "project_b", 5],
], columns=["quarter", "project", "github_stars"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
Out[72]:
| quarter | project | github_stars | |
|---|---|---|---|
| 0 | Q1-2024 | project_a | 1 |
| 1 | Q1-2024 | project_b | 1 |
| 2 | Q2-2024 | project_a | 2 |
| 3 | Q2-2024 | project_b | 2 |
| 4 | Q3-2024 | project_a | 4 |
| 5 | Q3-2024 | project_b | 3 |
| 6 | Q4-2024 | project_a | 8 |
| 7 | Q4-2024 | project_b | 4 |
| 8 | Q1-2025 | project_a | 16 |
| 9 | Q1-2025 | project_b | 5 |
In [73]:
Copied!
sns.barplot(df, x="quarter", y="github_stars", hue="project")
sns.barplot(df, x="quarter", y="github_stars", hue="project")
Out[73]:
<Axes: xlabel='quarter', ylabel='github_stars'>
In [74]:
Copied!
sns.lineplot(df, x="quarter", y="github_stars", hue="project")
sns.lineplot(df, x="quarter", y="github_stars", hue="project")
Out[74]:
<Axes: xlabel='quarter', ylabel='github_stars'>
In [75]:
Copied!
df
df
Out[75]:
| quarter | project | github_stars | |
|---|---|---|---|
| 0 | Q1-2024 | project_a | 1 |
| 1 | Q1-2024 | project_b | 1 |
| 2 | Q2-2024 | project_a | 2 |
| 3 | Q2-2024 | project_b | 2 |
| 4 | Q3-2024 | project_a | 4 |
| 5 | Q3-2024 | project_b | 3 |
| 6 | Q4-2024 | project_a | 8 |
| 7 | Q4-2024 | project_b | 4 |
| 8 | Q1-2025 | project_a | 16 |
| 9 | Q1-2025 | project_b | 5 |
In [76]:
Copied!
df = pd.DataFrame({
"project_a": [1, 2, 4, 8, 16],
"project_b": [1, 2, 3, 4, 5],
}, index=["Q1-2024", "Q2-2024", "Q3-2024", "Q4-2024", "Q1-2025"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
df = pd.DataFrame({
"project_a": [1, 2, 4, 8, 16],
"project_b": [1, 2, 3, 4, 5],
}, index=["Q1-2024", "Q2-2024", "Q3-2024", "Q4-2024", "Q1-2025"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df
Out[76]:
| project_a | project_b | |
|---|---|---|
| Q1-2024 | 1 | 1 |
| Q2-2024 | 2 | 2 |
| Q3-2024 | 4 | 3 |
| Q4-2024 | 8 | 4 |
| Q1-2025 | 16 | 5 |
In [110]:
Copied!
df = pd.read_csv(
"../data/movie.csv",
usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
dtype_backend="numpy_nullable",
)
df.head()
df = pd.read_csv(
"../data/movie.csv",
usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
dtype_backend="numpy_nullable",
)
df.head()
Out[110]:
| movie_title | content_rating | title_year | imdb_score | |
|---|---|---|---|---|
| 0 | Avatar | PG-13 | 2009.0 | 7.9 |
| 1 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 7.1 |
| 2 | Spectre | PG-13 | 2015.0 | 6.8 |
| 3 | The Dark Knight Rises | PG-13 | 2012.0 | 8.5 |
| 4 | Star Wars: Episode VII - The Force Awakens | <NA> | <NA> | 7.1 |
In [114]:
Copied!
df = pd.read_csv(
"../data/movie.csv",
usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
dtype_backend="numpy_nullable",
dtype={"title_year": pd.Int16Dtype()},
)
df.head()
df = pd.read_csv(
"../data/movie.csv",
usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
dtype_backend="numpy_nullable",
dtype={"title_year": pd.Int16Dtype()},
)
df.head()
Out[114]:
| movie_title | content_rating | title_year | imdb_score | |
|---|---|---|---|---|
| 0 | Avatar | PG-13 | 2009 | 7.9 |
| 1 | Pirates of the Caribbean: At World's End | PG-13 | 2007 | 7.1 |
| 2 | Spectre | PG-13 | 2015 | 6.8 |
| 3 | The Dark Knight Rises | PG-13 | 2012 | 8.5 |
| 4 | Star Wars: Episode VII - The Force Awakens | <NA> | <NA> | 7.1 |
In [79]:
Copied!
df["title_year"].min()
df["title_year"].min()
Out[79]:
1916
In [80]:
Copied!
df["title_year"].max()
df["title_year"].max()
Out[80]:
2016
In [81]:
Copied!
df = df.assign(
title_decade=lambda x: pd.cut(x["title_year"],
bins=range(1910, 2021, 10)))
df.head()
df = df.assign(
title_decade=lambda x: pd.cut(x["title_year"],
bins=range(1910, 2021, 10)))
df.head()
Out[81]:
| movie_title | content_rating | title_year | imdb_score | title_decade | |
|---|---|---|---|---|---|
| 0 | Avatar | PG-13 | 2009 | 7.9 | (2000.0, 2010.0] |
| 1 | Pirates of the Caribbean: At World's End | PG-13 | 2007 | 7.1 | (2000.0, 2010.0] |
| 2 | Spectre | PG-13 | 2015 | 6.8 | (2010.0, 2020.0] |
| 3 | The Dark Knight Rises | PG-13 | 2012 | 8.5 | (2010.0, 2020.0] |
| 4 | Star Wars: Episode VII - The Force Awakens | <NA> | <NA> | 7.1 | NaN |
In [82]:
Copied!
sns.boxplot(
data=df,
x="imdb_score",
y="title_decade",
)
sns.boxplot(
data=df,
x="imdb_score",
y="title_decade",
)
Out[82]:
<Axes: xlabel='imdb_score', ylabel='title_decade'>
In [83]:
Copied!
sns.violinplot(
data=df,
x="imdb_score",
y="title_decade",
)
sns.violinplot(
data=df,
x="imdb_score",
y="title_decade",
)
Out[83]:
<Axes: xlabel='imdb_score', ylabel='title_decade'>
In [84]:
Copied!
sns.swarmplot(
data=df,
x="imdb_score",
y="title_decade",
size=.25,
)
sns.swarmplot(
data=df,
x="imdb_score",
y="title_decade",
size=.25,
)
Out[84]:
<Axes: xlabel='imdb_score', ylabel='title_decade'>
In [85]:
Copied!
ratings_of_interest = {"G", "PG", "PG-13", "R"}
mask = (
(df["title_year"] >= 2013)
& (df["title_year"] <= 2015)
& (df["content_rating"].isin(ratings_of_interest))
)
data = df[mask].assign(
title_year=lambda x: x["title_year"].astype(pd.CategoricalDtype())
)
data.head()
ratings_of_interest = {"G", "PG", "PG-13", "R"}
mask = (
(df["title_year"] >= 2013)
& (df["title_year"] <= 2015)
& (df["content_rating"].isin(ratings_of_interest))
)
data = df[mask].assign(
title_year=lambda x: x["title_year"].astype(pd.CategoricalDtype())
)
data.head()
Out[85]:
| movie_title | content_rating | title_year | imdb_score | title_decade | |
|---|---|---|---|---|---|
| 2 | Spectre | PG-13 | 2015 | 6.8 | (2010, 2020] |
| 8 | Avengers: Age of Ultron | PG-13 | 2015 | 7.5 | (2010, 2020] |
| 14 | The Lone Ranger | PG-13 | 2013 | 6.5 | (2010, 2020] |
| 15 | Man of Steel | PG-13 | 2013 | 7.2 | (2010, 2020] |
| 20 | The Hobbit: The Battle of the Five Armies | PG-13 | 2014 | 7.5 | (2010, 2020] |
In [86]:
Copied!
sns.swarmplot(
data=data,
x="imdb_score",
y="title_year",
hue="content_rating",
)
sns.swarmplot(
data=data,
x="imdb_score",
y="title_year",
hue="content_rating",
)
Out[86]:
<Axes: xlabel='imdb_score', ylabel='title_year'>
In [87]:
Copied!
sns.catplot(
kind="swarm",
data=data,
x="imdb_score",
y="title_year",
col="content_rating",
col_wrap=2,
)
sns.catplot(
kind="swarm",
data=data,
x="imdb_score",
y="title_year",
col="content_rating",
col_wrap=2,
)
Out[87]:
<seaborn.axisgrid.FacetGrid at 0x7d6980a39dc0>