10分钟pandas教程.docx-资源下载

10分钟pandas教程.docx

1、10分钟pandas教程10分钟pandas教程对于数据处理分析的新手，花十分钟熟悉pandas很有必要，一起开始吧第一步要会导入pandas和其好基友们：In 1: import pandas as pdIn 2: import numpy as npIn 3: import matplotlib.pyplot as plt对象创建本节可以具体参考Data Structure Intro section。通过传入一个list的值来创建一个Series，并让pandas创建一个默认的序号索引：In 4: s = pd.Series(1,3,5,np.nan,6,8)In 5: sOut5: 0

2、 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64通过传入一个numpy数组，创建一个DataFrame，并以时间为索引以列为标签：In 6: dates = pd.date_range(20130101, periods=6)In 7: datesOut7: DatetimeIndex(2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, dtype=datetime64ns, freq=D)In 8: df = pd.DataFrame(np.random.rand

3、n(6,4), index=dates, columns=list(ABCD)In 9: dfOut9: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.0874

4、012013-01-06 -0.673690 0.113648 -1.478427 0.524988通过字典(dict)传入的对象而创建的DataFrame可以转为series样式：In 10: df2 = pd.DataFrame( A : 1., .: B : pd.Timestamp(20130102), .: C : pd.Series(1,index=list(range(4),dtype=float32), .: D : np.array(3 * 4,dtype=int32), .: E : pd.Categorical(test,train,test,train), .: F :

5、 foo ) .: In 11: df2Out11: A B C D E F0 1.0 2013-01-02 1.0 3 test foo1 1.0 2013-01-02 1.0 3 train foo2 1.0 2013-01-02 1.0 3 test foo3 1.0 2013-01-02 1.0 3 train foo其数据类型（dtypes）分别为：In 12: df2.dtypesOut12: A float64B datetime64nsC float32D int32E categoryF objectdtype: object如果你在使用IPython，利用Tab键的自动补全

6、会得到所有的列名称（除此外也有其他的公共属性）：In 13: df2.df2.A df2.booldf2.abs df2.boxplotdf2.add df2.Cdf2.add_prefix df2.clipdf2.add_suffix df2.clip_lowerdf2.align df2.clip_upperdf2.all df2.columnsdf2.any binedf2.append bine_firstdf2.apply pounddf2.applymap df2.consolidatedf2.as_blocks df2.convert_objectsdf2.asfreq df2.

7、copydf2.as_matrix df2.corrdf2.astype df2.corrwithdf2.at df2.countdf2.at_time df2.covdf2.axes df2.cummaxdf2.B df2.cummindf2.between_time df2.cumproddf2.bfill df2.cumsumdf2.blocks df2.D如你所见，A,B,C,D都被补全了，E也存在，但为了简洁被截断显示了。浏览数据详情参见Basics section。查看frame中顶部和尾部行的数据：In 14: df.head()Out14: A B C D2013-01-01

8、0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.087401In 15: df.tail(3)Out15: A B C D2013-01-04 0.721555 -0.706771 -1.03957

9、5 0.2718602013-01-05 -0.424972 0.567020 0.276232 -1.0874012013-01-06 -0.673690 0.113648 -1.478427 0.524988显示索引，列标签，以及numpy格式的数据：In 16: df.indexOut16: DatetimeIndex(2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, dtype=datetime64ns, freq=D)In 17: df.columnsOut17: Index(A, B, C

10、, D, dtype=object)In 18: df.valuesOut18: array( 0.4691, -0.2829, -1.5091, -1.1356, 1.2121, -0.1732, 0.1192, -1.0442, -0.8618, -2.1046, -0.4949, 1.0718, 0.7216, -0.7068, -1.0396, 0.2719, -0.425 , 0.567 , 0.2762, -1.0874, -0.6737, 0.1136, -1.4784, 0.525 )对数据进行快速总结：In 19: df.describe()Out19: A B C Dcou

11、nt 6.000000 6.000000 6.000000 6.000000mean 0.073711 -0.431125 -0.687758 -0.233103std 0.843157 0.922818 0.779887 0.973118min -0.861849 -2.104569 -1.509059 -1.13563225% -0.611510 -0.600794 -1.368714 -1.07661050% 0.022070 -0.228039 -0.767252 -0.38618875% 0.658444 0.041933 -0.034326 0.461706max 1.212112

12、 0.567020 0.276232 1.071804转置数据：In 20: df.TOut20: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427D -1.1356

13、32 -1.044236 1.071804 0.271860 -1.087401 0.524988按某一轴进行排序：In 21: df.sort_index(axis=1, ascending=False)Out21: D C B A2013-01-01 -1.135632 -1.509059 -0.282863 0.4691122013-01-02 -1.044236 0.119209 -0.173215 1.2121122013-01-03 1.071804 -0.494929 -2.104569 -0.8618492013-01-04 0.271860 -1.039575 -0.7067

14、71 0.7215552013-01-05 -1.087401 0.276232 0.567020 -0.4249722013-01-06 0.524988 -1.478427 0.113648 -0.673690按值排序：In 22: df.sort_values(by=B)Out22: A B C D2013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.2718602013-01-01 0.469112 -0.282863 -1.509059 -1.1356322

15、013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-06 -0.673690 0.113648 -1.478427 0.5249882013-01-05 -0.424972 0.567020 0.276232 -1.087401选择注意：虽然使用标准的Python/Numpy表达式进行选择和赋值是直观的，可以用于交互式工作，但对于生成代码，我们建议使用优化过的pandas数据访问方法：.at，.iat，.loc，.iloc和.ix。获取选择单独一列，返回一个Series，和df.A等同：In 23: dfAOut23: 2013-01

16、-01 0.4691122013-01-02 1.2121122013-01-03 -0.8618492013-01-04 0.7215552013-01-05 -0.4249722013-01-06 -0.673690Freq: D, Name: A, dtype: float64使用选择，对行进行切片：In 24: df0:3Out24: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2

17、.104569 -0.494929 1.071804In 25: df20130102:20130104Out25: A B C D2013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.0718042013-01-04 0.721555 -0.706771 -1.039575 0.271860用标签选择参见Selection by Label。使用标签选择得到一个交叉项：In 26: df.locdates0Out26: A 0.469112B -0.282863C

18、-1.509059D -1.135632Name: 2013-01-01 00:00:00, dtype: float64使用标签选择多个轴：In 27: df.loc:,A,BOut27: A B2013-01-01 0.469112 -0.2828632013-01-02 1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04 0.721555 -0.7067712013-01-05 -0.424972 0.5670202013-01-06 -0.673690 0.113648显示标签切片，起止点都被包括在内：In 28: df

19、.loc20130102:20130104,A,BOut28: A B2013-01-02 1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04 0.721555 -0.706771减少返回对象的维度：In 29: df.loc20130102,A,BOut29: A 1.212112B -0.173215Name: 2013-01-02 00:00:00, dtype: float64得到一个标量：In 30: df.locdates0,AOut30: 0.46911229990718628更快的速度！（和上面的方法一样）In

20、31: df.atdates0,AOut31: 0.46911229990718628以位置选择更多参见：Selection by Position通过传入整数位置进行选择In 32: df.iloc3Out32: A 0.721555B -0.706771C -1.039575D 0.271860Name: 2013-01-04 00:00:00, dtype: float64通过整数切片，和numpy、python的操作类似In 33: df.iloc3:5,0:2Out33: A B2013-01-04 0.721555 -0.7067712013-01-05 -0.424972 0.5

21、67020通过整数位置坐标，和numpy、python的风格类似：In 34: df.iloc1,2,4,0,2Out34: A C2013-01-02 1.212112 0.1192092013-01-03 -0.861849 -0.4949292013-01-05 -0.424972 0.276232行切片：In 35: df.iloc1:3,:Out35: A B C D2013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929 1.071804列切片：In 36: df

22、.iloc:,1:3Out36: B C2013-01-01 -0.282863 -1.5090592013-01-02 -0.173215 0.1192092013-01-03 -2.104569 -0.4949292013-01-04 -0.706771 -1.0395752013-01-05 0.567020 0.2762322013-01-06 0.113648 -1.478427获得某一点的值：In 37: df.iloc1,1Out37: -0.173*330858更快的方法！In 38: df.iat1,1Out38: -0.173*330858布尔值索引使用单个列的（布尔）值进

23、行选择：In 39: dfdf.A 0Out39: A B C D2013-01-01 0.469112 -0.282863 -1.509059 -1.1356322013-01-02 1.212112 -0.173215 0.119209 -1.0442362013-01-04 0.721555 -0.706771 -1.039575 0.271860从一个DataFrame中，选择满足布尔条件的值：In 40: dfdf 0Out40: A B C D2013-01-01 0.469112 NaN NaN NaN2013-01-02 1.212112 NaN 0.119209 NaN201

24、3-01-03 NaN NaN NaN 1.0718042013-01-04 0.721555 NaN NaN 0.2718602013-01-05 NaN 0.567020 0.276232 NaN2013-01-06 NaN 0.113648 NaN 0.524988使用isin()方法进行过滤：In 41: df2 = df.copy()In 42: df2E = one, one,two,three,four,threeIn 43: df2Out43: A B C D E2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one2013-

25、01-02 1.212112 -0.173215 0.119209 -1.044236 one2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four2013-01-06 -0.673690 0.113648 -1.478427 0.524988 threeIn 44: df2df2E.isin(two,four)Out44: A B

26、 C D E2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four赋值创建一个新的列，并自动使数据与索引对齐In 45: s1 = pd.Series(1,2,3,4,5,6, index=pd.date_range(20130102, periods=6)In 46: s1Out46: 2013-01-02 12013-01-03 22013-01-04 32013-01-05 42013-01-06 52013-01-07 6Freq

27、: D, dtype: int64In 47: dfF = s1通过标签赋值：In 48: df.atdates0,A = 0通过位置赋值：In 49: df.iat0,1 = 0通过指定的numpy数组赋值：In 50: df.loc:,D = np.array(5 * len(df)In 51: dfOut51: A B C D F2013-01-01 0.000000 0.000000 -1.509059 5 NaN2013-01-02 1.212112 -0.173215 0.119209 5 1.02013-01-03 -0.861849 -2.104569 -0.494929 5

28、2.02013-01-04 0.721555 -0.706771 -1.039575 5 3.02013-01-05 -0.424972 0.567020 0.276232 5 4.02013-01-06 -0.673690 0.113648 -1.478427 5 5.0使用where操作赋值：In 52: df2 = df.copy()In 53: df2df2 0 = -df2In 54: df2Out54: A B C D F2013-01-01 0.000000 0.000000 -1.509059 -5 NaN2013-01-02 -1.212112 -0.173215 -0.11

29、9209 -5 -1.02013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.02013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.02013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.02013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0缺失数据pandas主要使用np.nan来表示缺失的数据。它默认不被计算所包括。详见：Missing Data section。重新索引允许更改/添加/删除指定轴上的索引。这将返回该数据的副本。In 55: df1 = df.reindex(index=dates0:4, columns=list(df.c

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？