直方图,区间划分和密度图¶
我们可以使用 pyplot 中的 hist() 方法来绘制直方图。
hist() 方法是 Matplotlib 库中的 pyplot 子库中的一种用于绘制直方图的函数。
hist() 方法可以用于可视化数据的分布情况,例如观察数据的中心趋势、偏态和异常值等。
hist() 方法
matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False, **kwargs) ```
参数说明:
-*
x:表示要绘制直方图的数据,可以是一个一维数组或列表-*。
bins:可选参数,表示直方图的箱数。默认为1-*0。
range:可选参数,表示直方图的值域范围,可以是一个二元组或列表。默认为None,即使用数据中的最小值和最-*大值。
density:可选参数,表示是否将直方图归一化。默认为False,即直方图的高度为每个箱子内的样本数,而不是频率或概-*率密度。
weights:可选参数,表示每个数据点的权重。默认为-*None。
cumulative:可选参数,表示是否绘制累积分布图。默认为-*False。
bottom:可选参数,表示直方图的起始高度。默认为None。
histtype:可选参数,表示直方图的类型,可以是'bar'、'barstacked'、'step'、'stepfilled'等。默-*认为'bar'。
align:可选参数,表示直方图箱子的对齐方式,可以是'left'、'mid'、'right'。-*默认为'mid'。
orientation:可选参数,表示直方图的方向,可以是'vertical'、'horizontal'。默认为'-*vertical'。
rwidth:可选参数,表示每个箱子的-*宽度。默认为None。
log:可选参数,表示是否在y轴上使用对数-*刻度。默认为False。
color:可-*选参数,表示直方图的颜色。
label:-*可选参数,表示直方图的标签。
stacked:可选参数,表示是否堆叠不同的直方图。默认为False。
**kwargs:可选参数,表示其他绘图参数。
语法格式如下:
#一如既往的设置
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid")
import numpy as np
#汉字显示设置
plt.rcParams['font.sans-serif'] =['SimHei'] #设置字体为SimHei
plt.rcParams['axes.unicode_minus'] = False #负号显式正常
data = np.random.randn(100)
plt.hist(data,edgecolor='y');
hist() 函数有许多调整图形的选项,下面的例子更加个性化:
plt.hist(data, bins=30, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
help(plt.hist)
Help on function hist in module matplotlib.pyplot:
hist(x: 'ArrayLike | Sequence[ArrayLike]', bins: 'int | Sequence[float] | str | None' = None, range: 'tuple[float, float] | None' = None, density: 'bool' = False, weights: 'ArrayLike | None' = None, cumulative: 'bool | float' = False, bottom: 'ArrayLike | float | None' = None, histtype: "Literal['bar', 'barstacked', 'step', 'stepfilled']" = 'bar', align: "Literal['left', 'mid', 'right']" = 'mid', orientation: "Literal['vertical', 'horizontal']" = 'vertical', rwidth: 'float | None' = None, log: 'bool' = False, color: 'ColorType | Sequence[ColorType] | None' = None, label: 'str | Sequence[str] | None' = None, stacked: 'bool' = False, *, data=None, **kwargs) -> 'tuple[np.ndarray | list[np.ndarray], np.ndarray, BarContainer | Polygon | list[BarContainer | Polygon]]'
Compute and plot a histogram.
This method uses `numpy.histogram` to bin the data in *x* and count the
number of values in each bin, then draws the distribution either as a
`.BarContainer` or `.Polygon`. The *bins*, *range*, *density*, and
*weights* parameters are forwarded to `numpy.histogram`.
If the data has already been binned and counted, use `~.bar` or
`~.stairs` to plot the distribution::
counts, bins = np.histogram(x)
plt.stairs(counts, bins)
Alternatively, plot pre-computed bins and counts using ``hist()`` by
treating each bin as a single point with a weight equal to its count::
plt.hist(bins[:-1], bins, weights=counts)
The data input *x* can be a singular array, a list of datasets of
potentially different lengths ([*x0*, *x1*, ...]), or a 2D ndarray in
which each column is a dataset. Note that the ndarray form is
transposed relative to the list form. If the input is an array, then
the return value is a tuple (*n*, *bins*, *patches*); if the input is a
sequence of arrays, then the return value is a tuple
([*n0*, *n1*, ...], *bins*, [*patches0*, *patches1*, ...]).
Masked arrays are not supported.
Parameters
----------
x : (n,) array or sequence of (n,) arrays
Input values, this takes either a single array or a sequence of
arrays which are not required to be of the same length.
bins : int or sequence or str, default: :rc:`hist.bins`
If *bins* is an integer, it defines the number of equal-width bins
in the range.
If *bins* is a sequence, it defines the bin edges, including the
left edge of the first bin and the right edge of the last bin;
in this case, bins may be unequally spaced. All but the last
(righthand-most) bin is half-open. In other words, if *bins* is::
[1, 2, 3, 4]
then the first bin is ``[1, 2)`` (including 1, but excluding 2) and
the second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which
*includes* 4.
If *bins* is a string, it is one of the binning strategies
supported by `numpy.histogram_bin_edges`: 'auto', 'fd', 'doane',
'scott', 'stone', 'rice', 'sturges', or 'sqrt'.
range : tuple or None, default: None
The lower and upper range of the bins. Lower and upper outliers
are ignored. If not provided, *range* is ``(x.min(), x.max())``.
Range has no effect if *bins* is a sequence.
If *bins* is a sequence or *range* is specified, autoscaling
is based on the specified bin range instead of the
range of x.
density : bool, default: False
If ``True``, draw and return a probability density: each bin
will display the bin's raw count divided by the total number of
counts *and the bin width*
(``density = counts / (sum(counts) * np.diff(bins))``),
so that the area under the histogram integrates to 1
(``np.sum(density * np.diff(bins)) == 1``).
If *stacked* is also ``True``, the sum of the histograms is
normalized to 1.
weights : (n,) array-like or None, default: None
An array of weights, of the same shape as *x*. Each value in
*x* only contributes its associated weight towards the bin count
(instead of 1). If *density* is ``True``, the weights are
normalized, so that the integral of the density over the range
remains 1.
cumulative : bool or -1, default: False
If ``True``, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints.
If *density* is also ``True`` then the histogram is normalized such
that the last bin equals 1.
If *cumulative* is a number less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if *density* is also
``True``, then the histogram is normalized such that the first bin
equals 1.
bottom : array-like, scalar, or None, default: None
Location of the bottom of each bin, i.e. bins are drawn from
``bottom`` to ``bottom + hist(x, bins)`` If a scalar, the bottom
of each bin is shifted by the same amount. If an array, each bin
is shifted independently and the length of bottom must match the
number of bins. If None, defaults to 0.
histtype : {'bar', 'barstacked', 'step', 'stepfilled'}, default: 'bar'
The type of histogram to draw.
- 'bar' is a traditional bar-type histogram. If multiple data
are given the bars are arranged side by side.
- 'barstacked' is a bar-type histogram where multiple
data are stacked on top of each other.
- 'step' generates a lineplot that is by default unfilled.
- 'stepfilled' generates a lineplot that is by default filled.
align : {'left', 'mid', 'right'}, default: 'mid'
The horizontal alignment of the histogram bars.
- 'left': bars are centered on the left bin edges.
- 'mid': bars are centered between the bin edges.
- 'right': bars are centered on the right bin edges.
orientation : {'vertical', 'horizontal'}, default: 'vertical'
If 'horizontal', `~.Axes.barh` will be used for bar-type histograms
and the *bottom* kwarg will be the left edges.
rwidth : float or None, default: None
The relative width of the bars as a fraction of the bin width. If
``None``, automatically compute the width.
Ignored if *histtype* is 'step' or 'stepfilled'.
log : bool, default: False
If ``True``, the histogram axis will be set to a log scale.
color : color or array-like of colors or None, default: None
Color or sequence of colors, one per dataset. Default (``None``)
uses the standard line color sequence.
label : str or None, default: None
String, or sequence of strings to match multiple datasets. Bar
charts yield multiple patches per dataset, but only the first gets
the label, so that `~.Axes.legend` will work as expected.
stacked : bool, default: False
If ``True``, multiple data are stacked on top of each other If
``False`` multiple data are arranged side by side if histtype is
'bar' or on top of each other if histtype is 'step'
Returns
-------
n : array or list of arrays
The values of the histogram bins. See *density* and *weights* for a
description of the possible semantics. If input *x* is an array,
then this is an array of length *nbins*. If input is a sequence of
arrays ``[data1, data2, ...]``, then this is a list of arrays with
the values of the histograms for each of the arrays in the same
order. The dtype of the array *n* (or of its element arrays) will
always be float even if no weighting or normalization is used.
bins : array
The edges of the bins. Length nbins + 1 (nbins left edges and right
edge of last bin). Always a single array even when multiple data
sets are passed in.
patches : `.BarContainer` or list of a single `.Polygon` or list of such objects
Container of individual artists used to create the histogram
or list of such containers if there are multiple input datasets.
Other Parameters
----------------
data : indexable object, optional
If given, the following parameters also accept a string ``s``, which is
interpreted as ``data[s]`` (unless this raises an exception):
*x*, *weights*
**kwargs
`~matplotlib.patches.Patch` properties
See Also
--------
hist2d : 2D histogram with rectangular bins
hexbin : 2D histogram with hexagonal bins
stairs : Plot a pre-computed histogram
bar : Plot a pre-computed histogram
Notes
-----
For large numbers of bins (>1000), plotting can be significantly
accelerated by using `~.Axes.stairs` to plot a pre-computed histogram
(``plt.stairs(*np.histogram(data))``), or by setting *histtype* to
'step' or 'stepfilled' rather than 'bar' or 'barstacked'.
plt.hist自定义选项的更多内容在其说明文档中有详述。在实践过程中,用频次直方图对不同分布特征样本进行比较时候,将设置变为: histtype='stepfilled',同时配合透明度的 alpha 设置,对比效果非常好。如下所示:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
如果你只想简单计算频次直方图(计算每段区间的样本数),不想绘图显示,这时候使用 np.histogram() 函数可以直接完成:
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
[ 5 21 41 26 7]
bin_edges
array([-2.46258921, -1.47129294, -0.47999667, 0.51129959, 1.50259586,
2.49389213])
二维频次直方图与数据区间划分¶
就像一维数据区间创建一维直方图一样,对于二维数据,我们也可以将二维数组按照二维区间进行切分,创建二维频次直方图。下面我们以多元正态分布为例,看看二维直方图的具体绘制过程。
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d: 绘制二维频次直方图¶
一个最简单的方法来绘制二维直方图就是使用 Matplotlib内置 plt.hist2d 函数来执行:
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('根据分组计数')
与 plt.hist一样,在 plt.hist2d 中也有许多调整图形与区间划分的配置选项。这些内容都在plt的使用文档中。正如一维数据有一个只计算不绘图的 np.histogram一样,在plt.hist2d 中同样有一个相同操作:应用 np.histogram2d函数。如下例子:
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
#print(counts)
#print(xedges)
如果对二维以上数据进行频次直方图操作,可以使用 np.histogramdd 函数,在此不做介绍和掌握要求,需要数据处理请参阅np.histogramdd 函数说明文档。
plt.hexbin: 六边形区间划分¶
上面介绍的频次直方图由正交坐标的方块分割而成,还有一种常用的方式是使用正六边形进行切割。Matplotlib 提供了 plt.hexbin函数作为绘制方法,它可以将二维数据集分割成蜂窝状。如下图所示:
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
plt.hexbin 同样有一大堆有趣的配置选项,包括为每一个数据点设置不同权重,以及用任意Numpy累计函数改变每一个六边形区间划分的结果(使用权重均值、标准差等指标)。
核密度估计¶
另一种常用的多维数据分析方法是核密度估计(KDE),我们先来看看如何使用KDE抹去空间中离散的点,从而拟合出一个平滑曲线。
from scipy.stats import gaussian_kde
# 拟合数组的维度 [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)
# 用一对规则的网格进行拟合
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
# 画出结果图形
plt.imshow(Z.reshape(Xgrid.shape),
origin='lower', aspect='auto',
extent=[-3.5, 3.5, -6, 6],
cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
KDE方法通过不同的平滑带宽长度,在拟合函数的准确性与平滑性之间进行权衡。但是想找到恰当的平滑带宽长度并非易事,而Gaussian_kde 则通过经验方法找到输入数据平滑带宽长度的近似最优解。具体关于KDE的分析,我们在机器学习部分进行介绍。