10分钟pandas教程.docx

上传人:b****6 文档编号:7130308 上传时间:2023-01-21 格式:DOCX 页数:34 大小:120.49KB
下载 相关 举报
10分钟pandas教程.docx_第1页
第1页 / 共34页
10分钟pandas教程.docx_第2页
第2页 / 共34页
10分钟pandas教程.docx_第3页
第3页 / 共34页
10分钟pandas教程.docx_第4页
第4页 / 共34页
10分钟pandas教程.docx_第5页
第5页 / 共34页
点击查看更多>>
下载资源
资源描述

10分钟pandas教程.docx

《10分钟pandas教程.docx》由会员分享,可在线阅读,更多相关《10分钟pandas教程.docx(34页珍藏版)》请在冰豆网上搜索。

10分钟pandas教程.docx

10分钟pandas教程

10分钟pandas教程

对于数据处理分析的新手,花十分钟熟悉pandas很有必要,一起开始吧~

第一步要会导入pandas和其好基友们:

In[1]:

importpandasaspd

In[2]:

importnumpyasnp

In[3]:

importmatplotlib.pyplotasplt

对象创建

本节可以具体参考DataStructureIntrosection。

通过传入一个list的值来创建一个Series,并让pandas创建一个默认的序号索引:

In[4]:

s=pd.Series([1,3,5,np.nan,6,8])

In[5]:

s

Out[5]:

01.0

13.0

25.0

3NaN

46.0

58.0

dtype:

float64

通过传入一个numpy数组,创建一个DataFrame,并以时间为索引以列为标签:

In[6]:

dates=pd.date_range('20130101',periods=6)

In[7]:

dates

Out[7]:

DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03','2013-01-04',

'2013-01-05','2013-01-06'],

dtype='datetime64[ns]',freq='D')

In[8]:

df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

In[9]:

df

Out[9]:

ABCD

2013-01-010.469112-0.282863-1.509059-1.135632

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-03-0.861849-2.104569-0.4949291.071804

2013-01-040.721555-0.706771-1.0395750.271860

2013-01-05-0.4249720.5670200.276232-1.087401

2013-01-06-0.6736900.113648-1.4784270.524988

通过字典(dict)传入的对象而创建的DataFrame可以转为series样式:

In[10]:

df2=pd.DataFrame({'A':

1.,

....:

'B':

pd.Timestamp('20130102'),

....:

'C':

pd.Series(1,index=list(range(4)),dtype='float32'),

....:

'D':

np.array([3]*4,dtype='int32'),

....:

'E':

pd.Categorical(["test","train","test","train"]),

....:

'F':

'foo'})

....:

In[11]:

df2

Out[11]:

ABCDEF

01.02013-01-021.03testfoo

11.02013-01-021.03trainfoo

21.02013-01-021.03testfoo

31.02013-01-021.03trainfoo

其数据类型(dtypes)分别为:

In[12]:

df2.dtypes

Out[12]:

Afloat64

Bdatetime64[ns]

Cfloat32

Dint32

Ecategory

Fobject

dtype:

object

如果你在使用IPython,利用Tab键的自动补全会得到所有的列名称(除此外也有其他的公共属性):

In[13]:

df2.

df2.Adf2.bool

df2.absdf2.boxplot

df2.adddf2.C

df2.add_prefixdf2.clip

df2.add_suffixdf2.clip_lower

df2.aligndf2.clip_upper

df2.alldf2.columns

df2.anybine

df2.appendbine_first

df2.applypound

df2.applymapdf2.consolidate

df2.as_blocksdf2.convert_objects

df2.asfreqdf2.copy

df2.as_matrixdf2.corr

df2.astypedf2.corrwith

df2.atdf2.count

df2.at_timedf2.cov

df2.axesdf2.cummax

df2.Bdf2.cummin

df2.between_timedf2.cumprod

df2.bfilldf2.cumsum

df2.blocksdf2.D

如你所见,A,B,C,D都被补全了,E也存在,但为了简洁被截断显示了。

浏览数据

详情参见Basicssection。

查看frame中顶部和尾部行的数据:

In[14]:

df.head()

Out[14]:

ABCD

2013-01-010.469112-0.282863-1.509059-1.135632

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-03-0.861849-2.104569-0.4949291.071804

2013-01-040.721555-0.706771-1.0395750.271860

2013-01-05-0.4249720.5670200.276232-1.087401

In[15]:

df.tail(3)

Out[15]:

ABCD

2013-01-040.721555-0.706771-1.0395750.271860

2013-01-05-0.4249720.5670200.276232-1.087401

2013-01-06-0.6736900.113648-1.4784270.524988

显示索引,列标签,以及numpy格式的数据:

In[16]:

df.index

Out[16]:

DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03','2013-01-04',

'2013-01-05','2013-01-06'],

dtype='datetime64[ns]',freq='D')

In[17]:

df.columns

Out[17]:

Index(['A','B','C','D'],dtype='object')

In[18]:

df.values

Out[18]:

array([[0.4691,-0.2829,-1.5091,-1.1356],

[1.2121,-0.1732,0.1192,-1.0442],

[-0.8618,-2.1046,-0.4949,1.0718],

[0.7216,-0.7068,-1.0396,0.2719],

[-0.425,0.567,0.2762,-1.0874],

[-0.6737,0.1136,-1.4784,0.525]])

对数据进行快速总结:

In[19]:

df.describe()

Out[19]:

ABCD

count6.0000006.0000006.0000006.000000

mean0.073711-0.431125-0.687758-0.233103

std0.8431570.9228180.7798870.973118

min-0.861849-2.104569-1.509059-1.135632

25%-0.611510-0.600794-1.368714-1.076610

50%0.022070-0.228039-0.767252-0.386188

75%0.6584440.041933-0.0343260.461706

max1.2121120.5670200.2762321.071804

转置数据:

In[20]:

df.T

Out[20]:

2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06

A0.4691121.212112-0.8618490.721555-0.424972-0.673690

B-0.282863-0.173215-2.104569-0.7067710.5670200.113648

C-1.5090590.119209-0.494929-1.0395750.276232-1.478427

D-1.135632-1.0442361.0718040.271860-1.0874010.524988

按某一轴进行排序:

In[21]:

df.sort_index(axis=1,ascending=False)

Out[21]:

DCBA

2013-01-01-1.135632-1.509059-0.2828630.469112

2013-01-02-1.0442360.119209-0.1732151.212112

2013-01-031.071804-0.494929-2.104569-0.861849

2013-01-040.271860-1.039575-0.7067710.721555

2013-01-05-1.0874010.2762320.567020-0.424972

2013-01-060.524988-1.4784270.113648-0.673690

按值排序:

In[22]:

df.sort_values(by='B')

Out[22]:

ABCD

2013-01-03-0.861849-2.104569-0.4949291.071804

2013-01-040.721555-0.706771-1.0395750.271860

2013-01-010.469112-0.282863-1.509059-1.135632

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-06-0.6736900.113648-1.4784270.524988

2013-01-05-0.4249720.5670200.276232-1.087401

选择

注意:

虽然使用标准的Python/Numpy表达式进行选择和赋值是直观的,可以用于交互式工作,但对于生成代码,我们建议使用优化过的pandas数据访问方法:

.at,.iat,.loc,.iloc和.ix。

获取

选择单独一列,返回一个Series,和df.A等同:

In[23]:

df['A']

Out[23]:

2013-01-010.469112

2013-01-021.212112

2013-01-03-0.861849

2013-01-040.721555

2013-01-05-0.424972

2013-01-06-0.673690

Freq:

D,Name:

A,dtype:

float64

使用[]选择,对行进行切片:

In[24]:

df[0:

3]

Out[24]:

ABCD

2013-01-010.469112-0.282863-1.509059-1.135632

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-03-0.861849-2.104569-0.4949291.071804

In[25]:

df['20130102':

'20130104']

Out[25]:

ABCD

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-03-0.861849-2.104569-0.4949291.071804

2013-01-040.721555-0.706771-1.0395750.271860

用标签选择

参见SelectionbyLabel。

使用标签选择得到一个交叉项:

In[26]:

df.loc[dates[0]]

Out[26]:

A0.469112

B-0.282863

C-1.509059

D-1.135632

Name:

2013-01-0100:

00:

00,dtype:

float64

使用标签选择多个轴:

In[27]:

df.loc[:

['A','B']]

Out[27]:

AB

2013-01-010.469112-0.282863

2013-01-021.212112-0.173215

2013-01-03-0.861849-2.104569

2013-01-040.721555-0.706771

2013-01-05-0.4249720.567020

2013-01-06-0.6736900.113648

显示标签切片,起止点都被包括在内:

In[28]:

df.loc['20130102':

'20130104',['A','B']]

Out[28]:

AB

2013-01-021.212112-0.173215

2013-01-03-0.861849-2.104569

2013-01-040.721555-0.706771

减少返回对象的维度:

In[29]:

df.loc['20130102',['A','B']]

Out[29]:

A1.212112

B-0.173215

Name:

2013-01-0200:

00:

00,dtype:

float64

得到一个标量:

In[30]:

df.loc[dates[0],'A']

Out[30]:

0.46911229990718628

更快的速度!

(和上面的方法一样)

In[31]:

df.at[dates[0],'A']

Out[31]:

0.46911229990718628

以位置选择

更多参见:

SelectionbyPosition

通过传入整数位置进行选择

In[32]:

df.iloc[3]

Out[32]:

A0.721555

B-0.706771

C-1.039575

D0.271860

Name:

2013-01-0400:

00:

00,dtype:

float64

通过整数切片,和numpy、python的操作类似

In[33]:

df.iloc[3:

5,0:

2]

Out[33]:

AB

2013-01-040.721555-0.706771

2013-01-05-0.4249720.567020

通过整数位置坐标,和numpy、python的风格类似:

In[34]:

df.iloc[[1,2,4],[0,2]]

Out[34]:

AC

2013-01-021.2121120.119209

2013-01-03-0.861849-0.494929

2013-01-05-0.4249720.276232

行切片:

In[35]:

df.iloc[1:

3,:

]

Out[35]:

ABCD

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-03-0.861849-2.104569-0.4949291.071804

列切片:

In[36]:

df.iloc[:

1:

3]

Out[36]:

BC

2013-01-01-0.282863-1.509059

2013-01-02-0.1732150.119209

2013-01-03-2.104569-0.494929

2013-01-04-0.706771-1.039575

2013-01-050.5670200.276232

2013-01-060.113648-1.478427

获得某一点的值:

In[37]:

df.iloc[1,1]

Out[37]:

-0.173********330858

更快的方法!

In[38]:

df.iat[1,1]

Out[38]:

-0.173********330858

布尔值索引

使用单个列的(布尔)值进行选择:

In[39]:

df[df.A>0]

Out[39]:

ABCD

2013-01-010.469112-0.282863-1.509059-1.135632

2013-01-021.212112-0.1732150.119209-1.044236

2013-01-040.721555-0.706771-1.0395750.271860

从一个DataFrame中,选择满足布尔条件的值:

In[40]:

df[df>0]

Out[40]:

ABCD

2013-01-010.469112NaNNaNNaN

2013-01-021.212112NaN0.119209NaN

2013-01-03NaNNaNNaN1.071804

2013-01-040.721555NaNNaN0.271860

2013-01-05NaN0.5670200.276232NaN

2013-01-06NaN0.113648NaN0.524988

使用isin()方法进行过滤:

In[41]:

df2=df.copy()

In[42]:

df2['E']=['one','one','two','three','four','three']

In[43]:

df2

Out[43]:

ABCDE

2013-01-010.469112-0.282863-1.509059-1.135632one

2013-01-021.212112-0.1732150.119209-1.044236one

2013-01-03-0.861849-2.104569-0.4949291.071804two

2013-01-040.721555-0.706771-1.0395750.271860three

2013-01-05-0.4249720.5670200.276232-1.087401four

2013-01-06-0.6736900.113648-1.4784270.524988three

In[44]:

df2[df2['E'].isin(['two','four'])]

Out[44]:

ABCDE

2013-01-03-0.861849-2.104569-0.4949291.071804two

2013-01-05-0.4249720.5670200.276232-1.087401four

赋值

创建一个新的列,并自动使数据与索引对齐

In[45]:

s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))

In[46]:

s1

Out[46]:

2013-01-021

2013-01-032

2013-01-043

2013-01-054

2013-01-065

2013-01-076

Freq:

D,dtype:

int64

In[47]:

df['F']=s1

通过标签赋值:

In[48]:

df.at[dates[0],'A']=0

通过位置赋值:

In[49]:

df.iat[0,1]=0

通过指定的numpy数组赋值:

In[50]:

df.loc[:

'D']=np.array([5]*len(df))

In[51]:

df

Out[51]:

ABCDF

2013-01-010.0000000.000000-1.5090595NaN

2013-01-021.212112-0.1732150.11920951.0

2013-01-03-0.861849-2.104569-0.49492952.0

2013-01-040.721555-0.706771-1.03957553.0

2013-01-05-0.4249720.5670200.27623254.0

2013-01-06-0.6736900.113648-1.47842755.0

使用where操作赋值:

In[52]:

df2=df.copy()

In[53]:

df2[df2>0]=-df2

In[54]:

df2

Out[54]:

ABCDF

2013-01-010.0000000.000000-1.509059-5NaN

2013-01-02-1.212112-0.173215-0.119209-5-1.0

2013-01-03-0.861849-2.104569-0.494929-5-2.0

2013-01-04-0.721555-0.706771-1.039575-5-3.0

2013-01-05-0.424972-0.567020-0.276232-5-4.0

2013-01-06-0.673690-0.113648-1.478427-5-5.0

缺失数据

pandas主要使用np.nan来表示缺失的数据。

它默认不被计算所包括。

详见:

MissingDatasection。

重新索引允许更改/添加/删除指定轴上的索引。

这将返回该数据的副本。

In[55]:

df1=df.reindex(index=dates[0:

4],columns=list(df.c

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 小学教育 > 语文

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1