基于Pandas、Matplotlib、Seaborn的可视化扩展¶

虽然Pandas核心功能在于数组处理，但Pandas模块中包含了一些可视化功能，可以在不是很复杂的情况下，绘制出数据的可视化图形。

In [4]:

Copied!

import pandas as pd
import numpy as np
import pyarrow as pa
import pandas as pd
import numpy as np
import pyarrow as pa

导论¶

In [6]:

Copied!

import matplotlib.pyplot as plt
plt.ion();
import matplotlib.pyplot as plt
plt.ion();

从汇总数据创建图表¶

具体做法：¶

In [8]:

Copied!





ser = pd.Series(
    (x ** 2 for x in range(7)),
    name="book_sales",
    index=(f"Day {x + 1}" for x in range(7)),
    dtype=pd.Int64Dtype(),
)

ser
ser = pd.Series(
    (x ** 2 for x in range(7)),
    name="book_sales",
    index=(f"Day {x + 1}" for x in range(7)),
    dtype=pd.Int64Dtype(),
)

ser

Out[8]:

Day 1     0
Day 2     1
Day 3     4
Day 4     9
Day 5    16
Day 6    25
Day 7    36
Name: book_sales, dtype: Int64

In [10]:

Copied!

ser.plot();
ser.plot();

No description has been provided for this image

In [12]:

Copied!

ser.plot(kind="bar");
ser.plot(kind="bar");

In [14]:

Copied!

ser.plot(kind="barh");
ser.plot(kind="barh");

In [16]:

Copied!

ser.plot(kind="area");
ser.plot(kind="area");

In [20]:

Copied!

ser.plot(kind="pie");
ser.plot(kind="pie");

In [24]:

Copied!





df = pd.DataFrame({
    "book_sales": (x ** 2 for x in range(7)),
    "book_returns": [3, 2, 1, 0, 1, 2, 3],
}, index=(f"Day {x + 1}" for x in range(7)))
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df
df = pd.DataFrame({
    "book_sales": (x ** 2 for x in range(7)),
    "book_returns": [3, 2, 1, 0, 1, 2, 3],
}, index=(f"Day {x + 1}" for x in range(7)))
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df

Out[24]:

	book_sales	book_returns
Day 1	0	3
Day 2	1	2
Day 3	4	1
Day 4	9	0
Day 5	16	1
Day 6	25	2
Day 7	36	3

In [26]:

Copied!

df.plot();
df.plot();

In [28]:

Copied!

df.plot(kind="bar");
df.plot(kind="bar");

In [12]:

Copied!

df.plot(kind="bar", stacked=True)
df.plot(kind="bar", stacked=True)

Out[12]:

<Axes: >

In [30]:

Copied!

df.plot(kind="barh");
df.plot(kind="barh");

In [32]:

Copied!

df.plot(kind="barh", stacked=True);
df.plot(kind="barh", stacked=True);

In [15]:

Copied!

df.plot(kind="area")
df.plot(kind="area")

Out[15]:

<Axes: >

In [16]:

Copied!

df.plot(kind="area", stacked=False, alpha=0.5)
df.plot(kind="area", stacked=False, alpha=0.5)

Out[16]:

<Axes: >

绘制非聚合数据的分布¶

具体操作：¶

In [26]:

Copied!





np.random.seed(42)
ser = pd.Series(
    np.random.default_rng().normal(size=10_000),
    dtype=pd.Float64Dtype(),
)

ser
np.random.seed(42)
ser = pd.Series(
    np.random.default_rng().normal(size=10_000),
    dtype=pd.Float64Dtype(),
)

ser

Out[26]:

0      -1.136009
1      -0.845098
2        1.85341
3       0.012477
4      -0.075157
          ...   
9995   -0.324093
9996   -0.676388
9997    0.158774
9998    1.674571
9999    0.549696
Length: 10000, dtype: Float64

In [27]:

Copied!

ser.plot(kind="hist")
ser.plot(kind="hist")

Out[27]:

<Axes: ylabel='Frequency'>

In [28]:

Copied!

ser.plot(kind="hist", bins=2)
ser.plot(kind="hist", bins=2)

Out[28]:

<Axes: ylabel='Frequency'>

In [29]:

Copied!

ser.plot(kind="hist", bins=100)
ser.plot(kind="hist", bins=100)

Out[29]:

<Axes: ylabel='Frequency'>

In [30]:

Copied!





np.random.seed(42)
df = pd.DataFrame({
    "normal": np.random.default_rng().normal(size=10_000),
    "triangular": np.random.default_rng().triangular(-2, 0, 2, size=10_000),
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df.head()
np.random.seed(42)
df = pd.DataFrame({
    "normal": np.random.default_rng().normal(size=10_000),
    "triangular": np.random.default_rng().triangular(-2, 0, 2, size=10_000),
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df.head()

Out[30]:

	normal	triangular
0	-1.030928	-0.366722
1	-1.533267	0.18168
2	0.157239	-0.871965
3	0.538866	-0.370548
4	0.01223	0.646524

In [31]:

Copied!

df.plot(kind="hist")
df.plot(kind="hist")

Out[31]:

<Axes: ylabel='Frequency'>

In [32]:

Copied!

df.plot(kind="hist", alpha=0.5)
df.plot(kind="hist", alpha=0.5)

Out[32]:

<Axes: ylabel='Frequency'>

In [33]:

Copied!

df.plot(kind="hist", subplots=True)
df.plot(kind="hist", subplots=True)

Out[33]:

array([<Axes: ylabel='Frequency'>, <Axes: ylabel='Frequency'>],
      dtype=object)

In [34]:

Copied!

df.plot(kind="hist", alpha=0.5, bins=100)
df.plot(kind="hist", alpha=0.5, bins=100)

Out[34]:

<Axes: ylabel='Frequency'>

In [35]:

Copied!

ser.plot(kind="kde")
ser.plot(kind="kde")

Out[35]:

<Axes: ylabel='Density'>

In [36]:

Copied!

df.plot(kind="kde")
df.plot(kind="kde")

Out[36]:

<Axes: ylabel='Density'>

使用 Matplotlib 进行进一步的图表自定义¶

具体操作：¶

In [37]:

Copied!





ser = pd.Series(
    (x ** 2 for x in range(7)),
    name="book_sales",
    index=(f"Day {x + 1}" for x in range(7)),
    dtype=pd.Int64Dtype(),
)
fig, axes = plt.subplots(nrows=1, ncols=3)
ser.plot(ax=axes[0])
ser.plot(kind="bar", ax=axes[1])
ser.plot(kind="pie", ax=axes[2])
ser = pd.Series(
    (x ** 2 for x in range(7)),
    name="book_sales",
    index=(f"Day {x + 1}" for x in range(7)),
    dtype=pd.Int64Dtype(),
)
fig, axes = plt.subplots(nrows=1, ncols=3)
ser.plot(ax=axes[0])
ser.plot(kind="bar", ax=axes[1])
ser.plot(kind="pie", ax=axes[2])

Out[37]:

<Axes: ylabel='book_sales'>

In [38]:

Copied!





from matplotlib.gridspec import GridSpec

fig = plt.figure()
gs = GridSpec(2, 2, figure=fig)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ser.plot(ax=ax0)
ser.plot(kind="bar", ax=ax1)
ser.plot(kind="pie", ax=ax2)
from matplotlib.gridspec import GridSpec

fig = plt.figure()
gs = GridSpec(2, 2, figure=fig)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ser.plot(ax=ax0)
ser.plot(kind="bar", ax=ax1)
ser.plot(kind="pie", ax=ax2)

Out[38]:

<Axes: ylabel='book_sales'>

In [39]:

Copied!





from matplotlib.gridspec import GridSpec

fig = plt.figure()
fig.suptitle("Book Sales Visualized in Different Ways")
gs = GridSpec(2, 2, figure=fig, hspace=.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ax0 = ser.plot(ax=ax0)
ax0.set_title("Line chart")

ax1 = ser.plot(kind="bar", ax=ax1)
ax1.set_title("Bar chart")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Remove labels from chart and show in custom legend instead
ax2 = ser.plot(kind="pie", ax=ax2, labels=None)
ax2.legend(
    ser.index,
    bbox_to_anchor=(1, -0.2, 0.5, 1),  # put legend to right of chart
    prop={"size": 6}, # set font size for legend
)
ax2.set_title("Pie Chart")
ax2.set_ylabel(None)  # remove book_sales label
from matplotlib.gridspec import GridSpec

fig = plt.figure()
fig.suptitle("Book Sales Visualized in Different Ways")
gs = GridSpec(2, 2, figure=fig, hspace=.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, :])
ax0 = ser.plot(ax=ax0)
ax0.set_title("Line chart")

ax1 = ser.plot(kind="bar", ax=ax1)
ax1.set_title("Bar chart")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Remove labels from chart and show in custom legend instead
ax2 = ser.plot(kind="pie", ax=ax2, labels=None)
ax2.legend(
    ser.index,
    bbox_to_anchor=(1, -0.2, 0.5, 1),  # put legend to right of chart
    prop={"size": 6}, # set font size for legend
)
ax2.set_title("Pie Chart")
ax2.set_ylabel(None)  # remove book_sales label

Out[39]:

Text(0, 0.5, '')

散点图¶

具体操作：¶

In [40]:

Copied!





df = pd.DataFrame({
    "var_a": [1, 2, 3, 4, 5],
    "var_b": [1, 2, 4, 8, 16],
    "var_c": [500, 200, 600, 100, 400],
    "var_d": ["blue", "orange", "gray", "blue", "gray"],
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df
df = pd.DataFrame({
    "var_a": [1, 2, 3, 4, 5],
    "var_b": [1, 2, 4, 8, 16],
    "var_c": [500, 200, 600, 100, 400],
    "var_d": ["blue", "orange", "gray", "blue", "gray"],
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df

Out[40]:

	var_a	var_b	var_c	var_d
0	1	1	500	blue
1	2	2	200	orange
2	3	4	600	gray
3	4	8	100	blue
4	5	16	400	gray

In [41]:

Copied!





df.plot(
    kind="scatter",
    x="var_a",
    y="var_b",
    s="var_c",
    c="var_d",
)
df.plot(
    kind="scatter",
    x="var_a",
    y="var_b",
    s="var_c",
    c="var_d",
)

Out[41]:

<Axes: xlabel='var_a', ylabel='var_b'>

In [43]:

Copied!





df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    usecols=["city08", "highway08", "VClass", "fuelCost08", "year"],
)
df.head()
df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    usecols=["city08", "highway08", "VClass", "fuelCost08", "year"],
)
df.head()

Out[43]:

	city08	fuelCost08	highway08	VClass	year
0	19	2450	25	Two Seaters	1985
1	9	4700	14	Two Seaters	1985
2	23	1900	33	Subcompact Cars	1985
3	10	4700	12	Vans	1985
4	17	3400	23	Compact Cars	1993

In [47]:

Copied!





car_classes = (
    "Subcompact Cars",
    "Compact Cars",
    "Midsize Cars",
    "Large Cars",
    "Two Seaters",
)
mask = (df["year"] >= 2015) & df["VClass"].isin(car_classes)
df = df[mask]
df.head()
car_classes = (
    "Subcompact Cars",
    "Compact Cars",
    "Midsize Cars",
    "Large Cars",
    "Two Seaters",
)
mask = (df["year"] >= 2015) & df["VClass"].isin(car_classes)
df = df[mask]
df.head()

Out[47]:

	city08	fuelCost08	highway08	VClass	year
27058	16	3400	23	Subcompact Cars	2015
27059	20	2250	28	Compact Cars	2015
27060	26	1700	37	Midsize Cars	2015
27061	28	1600	39	Midsize Cars	2015
27062	25	1800	35	Midsize Cars	2015

In [49]:

Copied!





df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
)
df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
)

Out[49]:

<Axes: xlabel='city08', ylabel='highway08'>

In [51]:

Copied!





classes_ser = pd.Series(car_classes, dtype=pd.StringDtype())
cat = pd.CategoricalDtype(classes_ser)
df["VClass"] = df["VClass"].astype(cat)
df.head()
classes_ser = pd.Series(car_classes, dtype=pd.StringDtype())
cat = pd.CategoricalDtype(classes_ser)
df["VClass"] = df["VClass"].astype(cat)
df.head()

Out[51]:

	city08	fuelCost08	highway08	VClass	year
27058	16	3400	23	Subcompact Cars	2015
27059	20	2250	28	Compact Cars	2015
27060	26	1700	37	Midsize Cars	2015
27061	28	1600	39	Midsize Cars	2015
27062	25	1800	35	Midsize Cars	2015

In [53]:

Copied!





df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
)
df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
)

Out[53]:

<Axes: xlabel='city08', ylabel='highway08'>

In [55]:

Copied!





df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
)
df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
)

Out[55]:

<Axes: xlabel='city08', ylabel='highway08'>

In [57]:

Copied!





df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
    s="fuelCost08",
)
df.plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
    s="fuelCost08",
)

Out[57]:

<Axes: xlabel='city08', ylabel='highway08'>

In [59]:

Copied!





df.assign(
    scaled_fuel_cost=lambda x: x["fuelCost08"] / 25,
).plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
    s="scaled_fuel_cost",
    alpha=0.4,
)
df.assign(
    scaled_fuel_cost=lambda x: x["fuelCost08"] / 25,
).plot(
    kind="scatter",
    x="city08",
    y="highway08",
    c="VClass",
    colormap="Dark2",
    s="scaled_fuel_cost",
    alpha=0.4,
)

Out[59]:

<Axes: xlabel='city08', ylabel='highway08'>

探索分类数据¶

具体操作¶

In [66]:

Copied!

df = pd.read_csv("../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable")
df.head()
df = pd.read_csv("../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable")
df.head()

C:\Users\getwa\AppData\Local\Temp\ipykernel_8564\1310465292.py:1: DtypeWarning: Columns (72,74,75,77) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("../data/vehicles.csv.zip",

Out[66]:

	barrels08	city08	...	mfrCode	c240Dscr	c240bDscr	createdOn	modifiedOn	startStop
0	14.167143	19	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
1	27.046364	9	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
2	11.018889	23	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
3	27.046364	10	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
4	15.658421	17	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>

5 rows × 84 columns

In [68]:

Copied!

df.iloc[:, [72, 74, 75, 77]]
df.iloc[:, [72, 74, 75, 77]]

Out[68]:

	rangeA	mfrCode	c240Dscr	c240bDscr
0	<NA>	<NA>	<NA>	<NA>
1	<NA>	<NA>	<NA>	<NA>
2	<NA>	<NA>	<NA>	<NA>
3	<NA>	<NA>	<NA>	<NA>
4	<NA>	<NA>	<NA>	<NA>
...	...	...	...	...
47518	<NA>	<NA>	<NA>	<NA>
47519	<NA>	<NA>	<NA>	<NA>
47520	<NA>	<NA>	<NA>	<NA>
47521	<NA>	<NA>	<NA>	<NA>
47522	<NA>	<NA>	<NA>	<NA>

47523 rows × 4 columns

In [70]:

Copied!

df["rangeA"].value_counts()
df["rangeA"].value_counts()

Out[70]:

rangeA
290    74
270    58
280    56
310    41
277    38
       ..
45      1
36      1
42      1
327     1
166     1
Name: count, Length: 264, dtype: int64

In [72]:

Copied!

df["rangeA"].str.isnumeric().idxmax()
df["rangeA"].str.isnumeric().idxmax()

Out[72]:

In [74]:

Copied!

df.iloc[:, [74, 75, 77]].pipe(pd.isna).idxmin()
df.iloc[:, [74, 75, 77]].pipe(pd.isna).idxmin()

Out[74]:

mfrCode      23147
c240Dscr     25661
c240bDscr    25661
dtype: int64

In [78]:

Copied!





df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    dtype={
        "rangeA": pd.StringDtype(),
        "mfrCode": pd.StringDtype(),
        "c240Dscr": pd.StringDtype(),
        "c240bDscr": pd.StringDtype()
    }
)

df.head()
df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    dtype={
        "rangeA": pd.StringDtype(),
        "mfrCode": pd.StringDtype(),
        "c240Dscr": pd.StringDtype(),
        "c240bDscr": pd.StringDtype()
    }
)

df.head()

Out[78]:

	barrels08	city08	...	mfrCode	c240Dscr	c240bDscr	createdOn	modifiedOn	startStop
0	14.167143	19	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
1	27.046364	9	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
2	11.018889	23	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
3	27.046364	10	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
4	15.658421	17	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>

5 rows × 84 columns

In [80]:

Copied!

df.select_dtypes(include=["string"]).columns
df.select_dtypes(include=["string"]).columns

Out[80]:

Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model',
       'mpgData', 'trany', 'VClass', 'baseModel', 'guzzler', 'trans_dscr',
       'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA', 'evMotor',
       'mfrCode', 'c240Dscr', 'c240bDscr', 'createdOn', 'modifiedOn',
       'startStop'],
      dtype='object')

In [82]:

Copied!

df.select_dtypes(include=["string"]).nunique().sort_values()
df.select_dtypes(include=["string"]).nunique().sort_values()

Out[82]:

tCharger         1
sCharger         1
mpgData          2
startStop        2
guzzler          3
fuelType2        4
c240Dscr         5
drive            7
fuelType1        7
c240bDscr        7
atvType          9
fuelType        15
VClass          34
trany           40
trans_dscr      52
mfrCode         56
make           144
rangeA         245
modifiedOn     298
evMotor        400
createdOn      455
eng_dscr       608
baseModel     1451
model         5064
dtype: int64

In [84]:

Copied!





low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)

for index, column in enumerate(low_card):
    row, col = divmod(index, 3)
    ax = axes[row][col]
    df[column].value_counts().plot(kind="bar", ax=ax)
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)

for index, column in enumerate(low_card):
    row, col = divmod(index, 3)
    ax = axes[row][col]
    df[column].value_counts().plot(kind="bar", ax=ax)

In [86]:

Copied!





low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)

for index, column in enumerate(low_card):
    row = index % 3
    col = index // 3
    ax = axes[row][col]
    counts = df[column].value_counts()
    counts.set_axis(counts.index.str[:8]).plot(kind="bar", ax=ax)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=6)

plt.tight_layout()
low_card = df.select_dtypes(include=["string"]).nunique().sort_values().iloc[:9].index
fig, axes = plt.subplots(nrows=3, ncols=3)

for index, column in enumerate(low_card):
    row = index % 3
    col = index // 3
    ax = axes[row][col]
    counts = df[column].value_counts()
    counts.set_axis(counts.index.str[:8]).plot(kind="bar", ax=ax)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=6)

plt.tight_layout()

探索连续型数据¶

具体操作：¶

In [90]:

Copied!





df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    dtype={
        "rangeA": pd.StringDtype(),
        "mfrCode": pd.StringDtype(),
        "c240Dscr": pd.StringDtype(),
        "c240bDscr": pd.StringDtype()
    }
)
df.head()
df = pd.read_csv(
    "../data/vehicles.csv.zip",
    dtype_backend="numpy_nullable",
    dtype={
        "rangeA": pd.StringDtype(),
        "mfrCode": pd.StringDtype(),
        "c240Dscr": pd.StringDtype(),
        "c240bDscr": pd.StringDtype()
    }
)
df.head()

Out[90]:

	barrels08	city08	...	mfrCode	c240Dscr	c240bDscr	createdOn	modifiedOn	startStop
0	14.167143	19	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
1	27.046364	9	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
2	11.018889	23	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
3	27.046364	10	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>
4	15.658421	17	...	<NA>	<NA>	<NA>	Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	<NA>

5 rows × 84 columns

In [92]:

Copied!

df.select_dtypes(exclude=["string"]).columns
df.select_dtypes(exclude=["string"]).columns

Out[92]:

Index(['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08',
       'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2',
       'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U',
       'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders',
       'displ', 'engId', 'feScore', 'fuelCost08', 'fuelCostA08', 'ghgScore',
       'ghgScoreA', 'highway08', 'highway08U', 'highwayA08', 'highwayA08U',
       'highwayCD', 'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4',
       'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity', 'rangeCityA',
       'rangeHwy', 'rangeHwyA', 'UCity', 'UCityA', 'UHighway', 'UHighwayA',
       'year', 'youSaveSpend', 'charge240b', 'phevCity', 'phevHwy',
       'phevComb'],
      dtype='object')

In [94]:

Copied!

df.select_dtypes(
    exclude=["string"]
).pipe(pd.isna).sum().sort_values(ascending=False).head()
df.select_dtypes(
    exclude=["string"]
).pipe(pd.isna).sum().sort_values(ascending=False).head()

Out[94]:

cylinders     801
displ         799
barrels08       0
barrelsA08      0
city08          0
dtype: int64

In [96]:

Copied!

df.loc[df["cylinders"].isna(), ["make", "model"]].value_counts()
df.loc[df["cylinders"].isna(), ["make", "model"]].value_counts()

Out[96]:

make       model                           
Fiat       500e                                8
BYD        e6                                  7
Ford       Focus Electric                      7
Chevrolet  Bolt EV                             7
smart      fortwo electric drive coupe         7
                                              ..
Lucid      Air Dream R AWD w/21 inch wheels    1
           Air Dream R AWD w/19 inch wheels    1
Audi       Q4 40 e-tron                        1
Vinfast    VF 9 Plus                           1
           VF 9 Eco                            1
Name: count, Length: 450, dtype: int64

In [98]:

Copied!

df["cylinders"] = df["cylinders"].fillna(0)
df["cylinders"] = df["cylinders"].fillna(0)

In [100]:

Copied!

df.loc[df["displ"].isna(), ["make", "model"]].value_counts()
df.loc[df["displ"].isna(), ["make", "model"]].value_counts()

Out[100]:

make            model                             
Fiat            500e                                  8
Ford            Focus Electric                        7
Toyota          RAV4 EV                               7
smart           fortwo electric drive coupe           7
Nissan          Leaf                                  7
                                                     ..
Lexus           RZ 450e AWD (20 inch Wheels)          1
                RZ 450e AWD (20 inch wheels)          1
Vinfast         VF 9 Plus                             1
Azure Dynamics  Transit Connect Electric Van/Wagon    1
BMW             Active E                              1
Name: count, Length: 449, dtype: int64

In [102]:

Copied!

df["displ"].nunique()
df["displ"].nunique()

Out[102]:

In [104]:

Copied!

df["city08"].plot(kind="hist")
df["city08"].plot(kind="hist")

Out[104]:

<Axes: ylabel='Frequency'>

In [106]:

Copied!

df["city08"].plot(kind="hist", bins=30)
df["city08"].plot(kind="hist", bins=30)

Out[106]:

<Axes: ylabel='Frequency'>

In [108]:

Copied!





fig, axes = plt.subplots(nrows=2, ncols=1)
axes[0].set_xlim(0, 40)
axes[1].set_xlim(0, 40)

df["city08"].plot(kind="kde", ax=axes[0])
df["highway08"].plot(kind="kde", ax=axes[1])

axes[0].set_ylabel("city")
axes[1].set_ylabel("highway")
fig, axes = plt.subplots(nrows=2, ncols=1)
axes[0].set_xlim(0, 40)
axes[1].set_xlim(0, 40)

df["city08"].plot(kind="kde", ax=axes[0])
df["highway08"].plot(kind="kde", ax=axes[1])

axes[0].set_ylabel("city")
axes[1].set_ylabel("highway")

Out[108]:

Text(0, 0.5, 'highway')

使用 seaborn 进行更高级图形绘制¶

In [71]:

Copied!

import seaborn as sns
sns.set_theme()
sns.set_style("white")
import seaborn as sns
sns.set_theme()
sns.set_style("white")

具体操作：¶

In [72]:

Copied!





df = pd.DataFrame([
    ["Q1-2024", "project_a", 1],
    ["Q1-2024", "project_b", 1],
    ["Q2-2024", "project_a", 2],
    ["Q2-2024", "project_b", 2],
    ["Q3-2024", "project_a", 4],
    ["Q3-2024", "project_b", 3],
    ["Q4-2024", "project_a", 8],
    ["Q4-2024", "project_b", 4],
    ["Q1-2025", "project_a", 16],
    ["Q1-2025", "project_b", 5],
], columns=["quarter", "project", "github_stars"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df
df = pd.DataFrame([
    ["Q1-2024", "project_a", 1],
    ["Q1-2024", "project_b", 1],
    ["Q2-2024", "project_a", 2],
    ["Q2-2024", "project_b", 2],
    ["Q3-2024", "project_a", 4],
    ["Q3-2024", "project_b", 3],
    ["Q4-2024", "project_a", 8],
    ["Q4-2024", "project_b", 4],
    ["Q1-2025", "project_a", 16],
    ["Q1-2025", "project_b", 5],
], columns=["quarter", "project", "github_stars"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df

Out[72]:

	quarter	project	github_stars
0	Q1-2024	project_a	1
1	Q1-2024	project_b	1
2	Q2-2024	project_a	2
3	Q2-2024	project_b	2
4	Q3-2024	project_a	4
5	Q3-2024	project_b	3
6	Q4-2024	project_a	8
7	Q4-2024	project_b	4
8	Q1-2025	project_a	16
9	Q1-2025	project_b	5

In [73]:

Copied!

sns.barplot(df, x="quarter", y="github_stars", hue="project")
sns.barplot(df, x="quarter", y="github_stars", hue="project")

Out[73]:

<Axes: xlabel='quarter', ylabel='github_stars'>

In [74]:

Copied!

sns.lineplot(df, x="quarter", y="github_stars", hue="project")
sns.lineplot(df, x="quarter", y="github_stars", hue="project")

Out[74]:

<Axes: xlabel='quarter', ylabel='github_stars'>

In [75]:

Copied!

df
df

Out[75]:

	quarter	project	github_stars
0	Q1-2024	project_a	1
1	Q1-2024	project_b	1
2	Q2-2024	project_a	2
3	Q2-2024	project_b	2
4	Q3-2024	project_a	4
5	Q3-2024	project_b	3
6	Q4-2024	project_a	8
7	Q4-2024	project_b	4
8	Q1-2025	project_a	16
9	Q1-2025	project_b	5

In [76]:

Copied!





df = pd.DataFrame({
    "project_a": [1, 2, 4, 8, 16],
    "project_b": [1, 2, 3, 4, 5],
}, index=["Q1-2024", "Q2-2024", "Q3-2024", "Q4-2024", "Q1-2025"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df
df = pd.DataFrame({
    "project_a": [1, 2, 4, 8, 16],
    "project_b": [1, 2, 3, 4, 5],
}, index=["Q1-2024", "Q2-2024", "Q3-2024", "Q4-2024", "Q1-2025"])
df = df.convert_dtypes(dtype_backend="numpy_nullable")

df

Out[76]:

	project_a	project_b
Q1-2024	1	1
Q2-2024	2	2
Q3-2024	4	3
Q4-2024	8	4
Q1-2025	16	5

In [110]:

Copied!





df = pd.read_csv(
    "../data/movie.csv",
    usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
    dtype_backend="numpy_nullable",
)
df.head()
df = pd.read_csv(
    "../data/movie.csv",
    usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
    dtype_backend="numpy_nullable",
)
df.head()

Out[110]:

	movie_title	content_rating	title_year	imdb_score
0	Avatar	PG-13	2009.0	7.9
1	Pirates of the Caribbean: At World's End	PG-13	2007.0	7.1
2	Spectre	PG-13	2015.0	6.8
3	The Dark Knight Rises	PG-13	2012.0	8.5
4	Star Wars: Episode VII - The Force Awakens	<NA>	<NA>	7.1

In [114]:

Copied!





df = pd.read_csv(
    "../data/movie.csv",
    usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
    dtype_backend="numpy_nullable",
    dtype={"title_year": pd.Int16Dtype()},
)
df.head()
df = pd.read_csv(
    "../data/movie.csv",
    usecols=["movie_title", "title_year", "imdb_score", "content_rating"],
    dtype_backend="numpy_nullable",
    dtype={"title_year": pd.Int16Dtype()},
)
df.head()

Out[114]:

	movie_title	content_rating	title_year	imdb_score
0	Avatar	PG-13	2009	7.9
1	Pirates of the Caribbean: At World's End	PG-13	2007	7.1
2	Spectre	PG-13	2015	6.8
3	The Dark Knight Rises	PG-13	2012	8.5
4	Star Wars: Episode VII - The Force Awakens	<NA>	<NA>	7.1

In [79]:

Copied!

df["title_year"].min()
df["title_year"].min()

Out[79]:

In [80]:

Copied!

df["title_year"].max()
df["title_year"].max()

Out[80]:

In [81]:

Copied!

df = df.assign(
    title_decade=lambda x: pd.cut(x["title_year"],
                                  bins=range(1910, 2021, 10)))

df.head()
df = df.assign(
    title_decade=lambda x: pd.cut(x["title_year"],
                                  bins=range(1910, 2021, 10)))

df.head()

Out[81]:

	movie_title	content_rating	title_year	imdb_score	title_decade
0	Avatar	PG-13	2009	7.9	(2000.0, 2010.0]
1	Pirates of the Caribbean: At World's End	PG-13	2007	7.1	(2000.0, 2010.0]
2	Spectre	PG-13	2015	6.8	(2010.0, 2020.0]
3	The Dark Knight Rises	PG-13	2012	8.5	(2010.0, 2020.0]
4	Star Wars: Episode VII - The Force Awakens	<NA>	<NA>	7.1	NaN

In [82]:

Copied!





sns.boxplot(
    data=df,
    x="imdb_score",
    y="title_decade",
)
sns.boxplot(
    data=df,
    x="imdb_score",
    y="title_decade",
)

Out[82]:

<Axes: xlabel='imdb_score', ylabel='title_decade'>

In [83]:

Copied!





sns.violinplot(
    data=df,
    x="imdb_score",
    y="title_decade",
)
sns.violinplot(
    data=df,
    x="imdb_score",
    y="title_decade",
)

Out[83]:

<Axes: xlabel='imdb_score', ylabel='title_decade'>

In [84]:

Copied!





sns.swarmplot(
    data=df,
    x="imdb_score",
    y="title_decade",
    size=.25,
)
sns.swarmplot(
    data=df,
    x="imdb_score",
    y="title_decade",
    size=.25,
)

Out[84]:

<Axes: xlabel='imdb_score', ylabel='title_decade'>

In [85]:

Copied!





ratings_of_interest = {"G", "PG", "PG-13", "R"}
mask = (
    (df["title_year"] >= 2013)
    & (df["title_year"] <= 2015)
    & (df["content_rating"].isin(ratings_of_interest))
)
data = df[mask].assign(
    title_year=lambda x: x["title_year"].astype(pd.CategoricalDtype())
)
data.head()
ratings_of_interest = {"G", "PG", "PG-13", "R"}
mask = (
    (df["title_year"] >= 2013)
    & (df["title_year"] <= 2015)
    & (df["content_rating"].isin(ratings_of_interest))
)
data = df[mask].assign(
    title_year=lambda x: x["title_year"].astype(pd.CategoricalDtype())
)
data.head()

Out[85]:

	movie_title	content_rating	title_year	imdb_score	title_decade
2	Spectre	PG-13	2015	6.8	(2010, 2020]
8	Avengers: Age of Ultron	PG-13	2015	7.5	(2010, 2020]
14	The Lone Ranger	PG-13	2013	6.5	(2010, 2020]
15	Man of Steel	PG-13	2013	7.2	(2010, 2020]
20	The Hobbit: The Battle of the Five Armies	PG-13	2014	7.5	(2010, 2020]

In [86]:

Copied!





sns.swarmplot(
    data=data,
    x="imdb_score",
    y="title_year",
    hue="content_rating",
)
sns.swarmplot(
    data=data,
    x="imdb_score",
    y="title_year",
    hue="content_rating",
)

Out[86]:

<Axes: xlabel='imdb_score', ylabel='title_year'>

In [87]:

Copied!





sns.catplot(
    kind="swarm",
    data=data,
    x="imdb_score",
    y="title_year",
    col="content_rating",
    col_wrap=2,
)
sns.catplot(
    kind="swarm",
    data=data,
    x="imdb_score",
    y="title_year",
    col="content_rating",
    col_wrap=2,
)

Out[87]:

<seaborn.axisgrid.FacetGrid at 0x7d6980a39dc0>

基于Pandas、Matplotlib、Seaborn的可视化扩展¶

导论¶

从汇总数据创建图表¶

具体做法：¶

更多内容…¶

绘制非聚合数据的分布¶

具体操作：¶

使用 Matplotlib 进行进一步的图表自定义¶

具体操作：¶

散点图¶

具体操作：¶

更多内容 …¶

探索分类数据¶

具体操作¶

探索连续型数据¶

具体操作：¶

使用 seaborn 进行更高级图形绘制¶

具体操作：¶