ImageVerifierCode 换一换
格式:DOCX , 页数:34 ,大小:120.49KB ,
资源ID:7130308      下载积分:3 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.bdocx.com/down/7130308.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(10分钟pandas教程.docx)为本站会员(b****6)主动上传,冰豆网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知冰豆网(发送邮件至service@bdocx.com或直接QQ联系客服),我们立即给予删除!

10分钟pandas教程.docx

1、10分钟pandas教程10分钟pandas教程对于数据处理分析的新手,花十分钟熟悉pandas很有必要,一起开始吧第一步要会导入pandas和其好基友们:In 1: import pandas as pdIn 2: import numpy as npIn 3: import matplotlib.pyplot as plt对象创建本节可以具体参考Data Structure Intro section。通过传入一个list的值来创建一个Series,并让pandas创建一个默认的序号索引:In 4: s = pd.Series(1,3,5,np.nan,6,8)In 5: sOut5: 0

2、 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64通过传入一个numpy数组,创建一个DataFrame,并以时间为索引以列为标签:In 6: dates = pd.date_range(20130101, periods=6)In 7: datesOut7: DatetimeIndex(2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, dtype=datetime64ns, freq=D)In 8: df = pd.DataFrame(np.random.rand

3、n(6,4), index=dates, columns=list(ABCD)In 9: dfOut9: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.0874

4、012013-01-06 -0.673690 0.113648 -1.478427 0.524988通过字典(dict)传入的对象而创建的DataFrame可以转为series样式:In 10: df2 = pd.DataFrame( A : 1., .: B : pd.Timestamp(20130102), .: C : pd.Series(1,index=list(range(4),dtype=float32), .: D : np.array(3 * 4,dtype=int32), .: E : pd.Categorical(test,train,test,train), .: F :

5、 foo ) .: In 11: df2Out11: A B C D E F0 1.0 2013-01-02 1.0 3 test foo1 1.0 2013-01-02 1.0 3 train foo2 1.0 2013-01-02 1.0 3 test foo3 1.0 2013-01-02 1.0 3 train foo其数据类型(dtypes)分别为:In 12: df2.dtypesOut12: A float64B datetime64nsC float32D int32E categoryF objectdtype: object如果你在使用IPython,利用Tab键的自动补全

6、会得到所有的列名称(除此外也有其他的公共属性):In 13: df2.df2.A df2.booldf2.abs df2.boxplotdf2.add df2.Cdf2.add_prefix df2.clipdf2.add_suffix df2.clip_lowerdf2.align df2.clip_upperdf2.all df2.columnsdf2.any binedf2.append bine_firstdf2.apply pounddf2.applymap df2.consolidatedf2.as_blocks df2.convert_objectsdf2.asfreq df2.

7、copydf2.as_matrix df2.corrdf2.astype df2.corrwithdf2.at df2.countdf2.at_time df2.covdf2.axes df2.cummaxdf2.B df2.cummindf2.between_time df2.cumproddf2.bfill df2.cumsumdf2.blocks df2.D如你所见,A,B,C,D都被补全了,E也存在,但为了简洁被截断显示了。浏览数据详情参见Basics section。查看frame中顶部和尾部行的数据:In 14: df.head()Out14: A B C D2013-01-01

8、0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.087401In 15: df.tail(3)Out15: A B C D2013-01-04 0.721555 -0.706771 -1.03957

9、5 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.0874012013-01-06 -0.673690 0.113648 -1.478427 0.524988显示索引,列标签,以及numpy格式的数据:In 16: df.indexOut16: DatetimeIndex(2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, dtype=datetime64ns, freq=D)In 17: df.columnsOut17: Index(A, B, C

10、, D, dtype=object)In 18: df.valuesOut18: array( 0.4691, -0.2829, -1.5091, -1.1356, 1.2121, -0.1732, 0.1192, -1.0442, -0.8618, -2.1046, -0.4949, 1.0718, 0.7216, -0.7068, -1.0396, 0.2719, -0.425 , 0.567 , 0.2762, -1.0874, -0.6737, 0.1136, -1.4784, 0.525 )对数据进行快速总结:In 19: df.describe()Out19: A B C Dcou

11、nt 6.000000 6.000000 6.000000 6.000000mean 0.073711 -0.431125 -0.687758 -0.233103std 0.843157 0.922818 0.779887 0.973118min -0.861849 -2.104569 -1.509059 -1.13563225% -0.611510 -0.600794 -1.368714 -1.07661050% 0.022070 -0.228039 -0.767252 -0.38618875% 0.658444 0.041933 -0.034326 0.461706max 1.212112

12、 0.567020 0.276232 1.071804转置数据:In 20: df.TOut20: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427D -1.1356

13、32 -1.044236 1.071804 0.271860 -1.087401 0.524988按某一轴进行排序:In 21: df.sort_index(axis=1, ascending=False)Out21: D C B A2013-01-01 -1.135632 -1.509059 -0.282863 0.4691122013-01-02 -1.044236 0.119209 -0.173215 1.2121122013-01-03 1.071804 -0.494929 -2.104569 -0.8618492013-01-04 0.271860 -1.039575 -0.7067

14、71 0.7215552013-01-05 -1.087401 0.276232 0.567020 -0.4249722013-01-06 0.524988 -1.478427 0.113648 -0.673690按值排序:In 22: df.sort_values(by=B)Out22: A B C D2013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-01 0.469112 -0.282863 -1.509059 -1.1356322

15、013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-06 -0.673690 0.113648 -1.478427 0.5249882013-01-05 -0.424972 0.567020 0.276232 -1.087401选择注意: 虽然使用标准的Python/Numpy表达式进行选择和赋值是直观的,可以用于交互式工作,但对于生成代码,我们建议使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix。获取选择单独一列,返回一个Series,和df.A等同:In 23: dfAOut23: 2013-01

16、-01 0.4691122013-01-02 1.2121122013-01-03 -0.8618492013-01-04 0.7215552013-01-05 -0.4249722013-01-06 -0.673690Freq: D, Name: A, dtype: float64使用选择,对行进行切片:In 24: df0:3Out24: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2

17、.104569 -0.494929 1.071804In 25: df20130102:20130104Out25: A B C D2013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.271860用标签选择参见Selection by Label。使用标签选择得到一个交叉项:In 26: df.locdates0Out26: A 0.469112B -0.282863C

18、-1.509059D -1.135632Name: 2013-01-01 00:00:00, dtype: float64使用标签选择多个轴:In 27: df.loc:,A,BOut27: A B2013-01-01 0.469112 -0.2828632013-01-02 1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04 0.721555 -0.7067712013-01-05 -0.424972 0.5670202013-01-06 -0.673690 0.113648显示标签切片,起止点都被包括在内:In 28: df

19、.loc20130102:20130104,A,BOut28: A B2013-01-02 1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04 0.721555 -0.706771减少返回对象的维度:In 29: df.loc20130102,A,BOut29: A 1.212112B -0.173215Name: 2013-01-02 00:00:00, dtype: float64得到一个标量:In 30: df.locdates0,AOut30: 0.46911229990718628更快的速度!(和上面的方法一样)In

20、31: df.atdates0,AOut31: 0.46911229990718628以位置选择更多参见:Selection by Position通过传入整数位置进行选择In 32: df.iloc3Out32: A 0.721555B -0.706771C -1.039575D 0.271860Name: 2013-01-04 00:00:00, dtype: float64通过整数切片,和numpy、python的操作类似In 33: df.iloc3:5,0:2Out33: A B2013-01-04 0.721555 -0.7067712013-01-05 -0.424972 0.5

21、67020通过整数位置坐标,和numpy、python的风格类似:In 34: df.iloc1,2,4,0,2Out34: A C2013-01-02 1.212112 0.1192092013-01-03 -0.861849 -0.4949292013-01-05 -0.424972 0.276232行切片:In 35: df.iloc1:3,:Out35: A B C D2013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.071804列切片:In 36: df

22、.iloc:,1:3Out36: B C2013-01-01 -0.282863 -1.5090592013-01-02 -0.173215 0.1192092013-01-03 -2.104569 -0.4949292013-01-04 -0.706771 -1.0395752013-01-05 0.567020 0.2762322013-01-06 0.113648 -1.478427获得某一点的值:In 37: df.iloc1,1Out37: -0.173*330858更快的方法!In 38: df.iat1,1Out38: -0.173*330858布尔值索引使用单个列的(布尔)值进

23、行选择:In 39: dfdf.A 0Out39: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-04 0.721555 -0.706771 -1.039575 0.271860从一个DataFrame中,选择满足布尔条件的值:In 40: dfdf 0Out40: A B C D2013-01-01 0.469112 NaN NaN NaN2013-01-02 1.212112 NaN 0.119209 NaN201

24、3-01-03 NaN NaN NaN 1.0718042013-01-04 0.721555 NaN NaN 0.2718602013-01-05 NaN 0.567020 0.276232 NaN2013-01-06 NaN 0.113648 NaN 0.524988使用isin()方法进行过滤:In 41: df2 = df.copy()In 42: df2E = one, one,two,three,four,threeIn 43: df2Out43: A B C D E2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one2013-

25、01-02 1.212112 -0.173215 0.119209 -1.044236 one2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four2013-01-06 -0.673690 0.113648 -1.478427 0.524988 threeIn 44: df2df2E.isin(two,four)Out44: A B

26、 C D E2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four赋值创建一个新的列,并自动使数据与索引对齐In 45: s1 = pd.Series(1,2,3,4,5,6, index=pd.date_range(20130102, periods=6)In 46: s1Out46: 2013-01-02 12013-01-03 22013-01-04 32013-01-05 42013-01-06 52013-01-07 6Freq

27、: D, dtype: int64In 47: dfF = s1通过标签赋值:In 48: df.atdates0,A = 0通过位置赋值:In 49: df.iat0,1 = 0通过指定的numpy数组赋值:In 50: df.loc:,D = np.array(5 * len(df)In 51: dfOut51: A B C D F2013-01-01 0.000000 0.000000 -1.509059 5 NaN2013-01-02 1.212112 -0.173215 0.119209 5 1.02013-01-03 -0.861849 -2.104569 -0.494929 5

28、2.02013-01-04 0.721555 -0.706771 -1.039575 5 3.02013-01-05 -0.424972 0.567020 0.276232 5 4.02013-01-06 -0.673690 0.113648 -1.478427 5 5.0使用where操作赋值:In 52: df2 = df.copy()In 53: df2df2 0 = -df2In 54: df2Out54: A B C D F2013-01-01 0.000000 0.000000 -1.509059 -5 NaN2013-01-02 -1.212112 -0.173215 -0.11

29、9209 -5 -1.02013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.02013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.02013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.02013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0缺失数据pandas主要使用np.nan来表示缺失的数据。它默认不被计算所包括。详见:Missing Data section。重新索引允许更改/添加/删除指定轴上的索引。 这将返回该数据的副本。In 55: df1 = df.reindex(index=dates0:4, columns=list(df.c

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1