Pandas中文官方文档之基础用法2.docx

资源描述

Pandas中文官方文档之基础用法2.docx

《Pandas中文官方文档之基础用法2.docx》由会员分享，可在线阅读，更多相关《Pandas中文官方文档之基础用法2.docx（10页珍藏版）》请在冰豆网上搜索。

Pandas中文官方文档之基础用法2.docx

Pandas中文官方文档之基础用法2

描述性统计

Series 与 DataFrame 支持大量计算描述性统计的方法与操作。

这些方法大部分都是 sum（）、mean（）、quantile（）等聚合函数，其输出结果比原始数据集小；此外，还有输出结果与原始数据集同样大小的 cumsum（）、 cumprod（）等函数。

这些方法都基本上都接受 axis 参数，如， ndarray.{sum,std,…}，但这里的 axis 可以用名称或整数指定：

∙Series：

无需 axis 参数

∙DataFrame：

▪"index"，即 axis=0，默认值

▪"columns",即 axis=1

示例如下：

In[77]:

Out[77]:

onetwothree

a1.3949811.772517NaN

b0.3430541.912123-0.050390

c0.6952461.4783691.227435

dNaN0.279344-0.613172

In[78]:

df.mean（0）

Out[78]:

one0.811094

two1.360588

three0.187958

dtype:

float64

In[79]:

df.mean

（1）

Out[79]:

a1.583749

b0.734929

c1.133683

d-0.166914

dtype:

float64

这些方法都支持 skipna，这个关键字指定是否要把缺失数据排除在外，默认值为 True。

In[80]:

df.sum（0,skipna=False）

Out[80]:

oneNaN

two5.442353

threeNaN

dtype:

float64

In[81]:

df.sum（axis=1,skipna=True）

Out[81]:

a3.167498

b2.204786

c3.401050

d-0.333828

dtype:

float64

结合广播机制或算数操作，可以描述不同统计过程，比如标准化，即渲染数据零均值与标准差1，这种操作非常简单：

In[82]:

ts_stand=（df-df.mean（））/df.std（）

In[83]:

ts_stand.std（）

Out[83]:

one1.0

two1.0

three1.0

dtype:

float64

In[84]:

xs_stand=df.sub（df.mean

（1）,axis=0）.div（df.std

（1）,axis=0）

In[85]:

xs_stand.std

（1）

Out[85]:

a1.0

b1.0

c1.0

d1.0

dtype:

float64

注：

cumsum（）与 cumprod（）等方法保留 NaN 值的位置。

这与 expanding（）和 rolling（）略显不同，详情请参阅本文。

In[86]:

df.cumsum（）

Out[86]:

onetwothree

a1.3949811.772517NaN

b1.7380353.684640-0.050390

c2.4332815.1630081.177045

dNaN5.4423530.563873

下面是常用函数汇总表。

每个函数都支持 level 参数，仅在数据对象为结构化Index 时使用。

函数

描述

count

统计非空值数量

sum

汇总值

mean

平均值

mad

平均绝对偏差

median

算数中位数

min

最小值

max

最大值

mode

众数

abs

绝对值

prod

乘积

std

贝塞尔校正的样本标准偏差

var

无偏方差

sem

平均值的标准误差

skew

样本偏度（第三阶）

kurt

样本峰度（第四阶）

quantile

样本分位数（不同%的值）

cumsum

累加

cumprod

累乘

cummax

累积最大值

cummin

累积最小值

注意：

Numpy的 mean、std、sum 等方法默认不统计Series里的空值。

In[87]:

np.mean（df['one']）

Out[87]:

0.8110935116651192

In[88]:

np.mean（df['one'].to_numpy（））

Out[88]:

nan

Series.nunique（）返回Series里所有非空值的唯一值。

In[89]:

series=pd.Series（np.random.randn（500））

In[90]:

series[20:

500]=np.nan

In[91]:

series[10:

20]=5

In[92]:

series.nunique（）

Out[92]:

数据总结：

describe

describe（）函数计算Series与DataFrame数据列的各种数据统计量，注意，这里排除了空值。

In[93]:

series=pd.Series（np.random.randn（1000））

In[94]:

series[:

2]=np.nan

In[95]:

series.describe（）

Out[95]:

count500.000000

mean-0.021292

std1.015906

min-2.683763

25%-0.699070

50%-0.069718

75%0.714483

max3.160915

dtype:

float64

In[96]:

frame=pd.DataFrame（np.random.randn（1000,5）,

....:

columns=['a','b','c','d','e']）

....:

In[97]:

frame.iloc[:

2]=np.nan

In[98]:

frame.describe（）

Out[98]:

abcde

count500.000000500.000000500.000000500.000000500.000000

mean0.0333870.030045-0.043719-0.0516860.005979

std1.0171520.9787431.0252701.0159881.006695

min-3.000951-2.637901-3.303099-3.159200-3.188821

25%-0.647623-0.576449-0.712369-0.691338-0.691115

50%0.047578-0.021499-0.023888-0.032652-0.025363

75%0.7299070.7758800.6188960.6700470.649748

max2.7401392.7523323.0042292.7287023.240991

此外，还可以指定输出结果包含的分位数：

In[99]:

series.describe（percentiles=[.05,.25,.75,.95]）

Out[99]:

count500.000000

mean-0.021292

std1.015906

min-2.683763

5%-1.645423

25%-0.699070

50%-0.069718

75%0.714483

95%1.711409

max3.160915

dtype:

float64

一般情况下，默认值包含中位数。

对于非数值型Series对象， describe（）返回值的总数、唯一值数量、出现次数最多的值及出现的次数。

In[100]:

s=pd.Series（['a','a','b','b','a','a',np.nan,'c','d','a']）

In[101]:

s.describe（）

Out[101]:

count9

unique4

topa

freq5

dtype:

object

注意：

对于混合型的DataFrame对象， describe（）只返回数值列的汇总统计量，如果没有数值列，则只显示类别型的列。

In[102]:

frame=pd.DataFrame（{'a':

['Yes','Yes','No','No'],'b':

range（4）}）

In[103]:

frame.describe（）

Out[103]:

count4.000000

mean1.500000

std1.290994

min0.000000

25%0.750000

50%1.500000

75%2.250000

max3.000000

include/exclude 参数的值为列表，用该参数可以控制包含或排除的数据类型。

这里还有一个特殊值，all：

In[104]:

frame.describe（include=['object']）

Out[104]:

count4

unique2

topYes

freq2

In[105]:

frame.describe（include=['number']）

Out[105]:

count4.000000

mean1.500000

std1.290994

min0.000000

25%0.750000

50%1.500000

75%2.250000

max3.000000

In[106]:

frame.describe（include='all'）

Out[106]:

count44.000000

unique2NaN

topYesNaN

freq2NaN

meanNaN1.500000

stdNaN1.290994

minNaN0.000000

25%NaN0.750000

50%NaN1.500000

75%NaN2.250000

maxNaN3.000000

本功能依托于 select_dtypes，要了解该参数接受哪些输入内容请参阅本文。

最大值与最小值对应的索引

Series与DataFrame的 idxmax（）与 idxmin（）函数计算最大值与最小值对应的索引。

In[107]:

s1=pd.Series（np.random.randn（5））

In[108]:

Out[108]:

01.118076

1-0.352051

2-1.242883

3-1.277155

4-0.641184

dtype:

float64

In[109]:

s1.idxmin（）,s1.idxmax（）

Out[109]:

（3,0）

In[110]:

df1=pd.DataFrame（np.random.randn（5,3）,columns=['A','B','C']）

In[111]:

df1

Out[111]:

ABC

0-0.327863-0.946180-0.137570

1-0.186235-0.257213-0.486567

2-0.507027-0.871259-0.111110

32.000339-2.4305050.089759

4-0.321434-0.0336950.096271

In[112]:

df1.idxmin（axis=0）

Out[112]:

dtype:

int64

In[113]:

df1.idxmax（axis=1）

Out[113]:

dtype:

object

多行或多列中存在多个最大值或最小值时，idxmax（）与 idxmin（）只返回匹配到的第一个值的 Index：

In[114]:

df3=pd.DataFrame（[2,1,1,3,np.nan],columns=['A'],index=list（'edcba'））

In[115]:

df3

Out[115]:

e2.0

d1.0

c1.0

b3.0

aNaN

In[116]:

df3['A'].idxmin（）

Out[116]:

'd'

tip注意

idxmin 与 idxmax 对应Numpy里的 argmin 与 argmax。

值计数（直方图）与众数

Series的 value_counts（）方法及顶级函数计算一维数组中数据值的直方图，还可以用作常规数组的函数：

In[117]:

data=np.random.randint（0,7,size=50）

In[118]:

data

Out[118]:

array（[6,6,2,3,5,3,2,5,4,5,4,3,4,5,0,2,0,4,2,0,3,2,

2,5,6,5,3,4,6,4,3,5,6,4,3,6,2,6,6,2,3,4,2,1,

6,2,6,1,5,4]）

In[119]:

s=pd.Series（data）

In[120]:

s.value_counts（）

Out[120]:

610

210

dtype:

int64

In[121]:

pd.value_counts（data）

Out[121]:

610

210

dtype:

int64

与上述操作类似，还可以统计Series或DataFrame的众数，即出现频率最高的值：

In[122]:

s5=pd.Series（[1,1,3,3,3,5,5,7,7,7]）

In[123]:

s5.mode（）

Out[123]:

dtype:

int64

In[124]:

df5=pd.DataFrame（{"A":

np.random.randint（0,7,size=50）,

.....:

"B":

np.random.randint（-10,15,size=50）}）

.....:

In[125]:

df5.mode（）

Out[125]:

01.0-9

1NaN10

2NaN13

离散化与分位数

cut（）函数（以值为依据实现分箱）及 qcut（）函数（以样本分位数为依据实现分箱）用于连续值的离散化：

In[126]:

arr=np.random.randn（20）

In[127]:

factor=pd.cut（arr,4）

In[128]:

factor

Out[128]:

[（-0.251,0.464],（-0.968,-0.251],（0.464,1.179],（-0.251,0.464],（-0.968,-0.251],...,（-0.251,0.464],（-0.968,-0.251],（-0.968,-0.251],（-0.968,-0.251],（-0.968,-0.251]]

Length:

Categories（4,interval[float64]）:

[（-0.968,-0.251]<（-0.251,0.464]<（0.464,1.179]<

（1.179,1.893]]

In[129]:

factor=pd.cut（arr,[-5,-1,0,1,5]）

In[130]:

factor

Out[130]:

[（0,1],（-1,0],（0,1],（0,1],（-1,0],...,（-1,0],（-1,0],（-1,0],（-1,0],（-1,0]]

Length:

Categories（4,interval[int64]）:

[（-5,-1]<（-1,0]<（0,1]<（1,5]]

qcut（）计算样本分位数。

比如，下列代码按等距分位数分割正态分布的数据：

In[131]:

arr=np.random.randn（30）

In[132]:

factor=pd.qcut（arr,[0,.25,.5,.75,1]）

In[133]:

factor

Out[133]:

[（0.569,1.184],（-2.278,-0.301],（-2.278,-0.301],（0.569,1.184],（0.569,1.184],...,（-0.301,0.569],（1.184,2.346],（1.184,2.346],（-0.301,0.569],（-2.278,-0.301]]

Length:

Categories（4,interval[float64]）:

[（-2.278,-0.301]<（-0.301,0.569]<（0.569,1.184]<

（1.184,2.346]]

In[134]:

pd.value_counts（factor）

Out[134]:

（1.184,2.346]8

（-2.278,-0.301]8

（0.569,1.184]7

（-0.301,0.569]7

dtype:

int64

定义分箱时，还可以传递无穷值：

In[135]:

arr=np.random.randn（20）

In[136]:

factor=pd.cut（arr,[-np.inf,0,np.inf]）

In[137]:

factor

Out[137]:

[（-inf,0.0],（0.0,inf],（0.0,inf],（-inf,0.0],（-inf,0.0],...,（-inf,0.0],（-inf,0.0],（-inf,0.0],（0.0,inf],（0.0,inf]]

Length:

Categories（2,interval[float64]）:

[（-inf,0.0]<（0.0,inf]]

展开阅读全文