完整版计算机语言python100道pandas含答案.docx

资源描述

完整版计算机语言python100道pandas含答案.docx

《完整版计算机语言python100道pandas含答案.docx》由会员分享，可在线阅读，更多相关《完整版计算机语言python100道pandas含答案.docx（13页珍藏版）》请在冰豆网上搜索。

完整版计算机语言python100道pandas含答案.docx

完整版计算机语言python100道pandas含答案

1.Importpandasunderthenamepd.

In [1]:

importpandasaspd

importnumpyasnp

2.Printtheversionofpandasthathasbeenimported.

In [2]:

pd.__version_

3.Printoutalltheversioninformationofthelibrariesthatarerequiredbythepandaslibrary

In [3]:

pd.show_versions（）

4.CreateaDataFramedffromthisdictionarydatawhichhastheindexlabels.

In [2]:

data={'animal':

['cat','cat','snake','dog','dog','cat','snake','cat','dog

'age':

[2.5,3,0.5,np.nan,5,2,4.5,np.nan,7,3],

'visits':

[1,3,2,3,2,3,1,1,2,1],

'priority':

['yes','yes','no','yes','no','no','no','yes','no','no']

labels=['a','b','c','d','e','f','g','h','i','j']

df=pd.DataFrame（data,index=labels）

5.DisplayasummaryofthebasicinformationaboutthisDataFrameanditsdata.

In [5]:

df.info（）

#...or...

df.describe（）

6.Returnthefirst3rowsoftheDataFramedf

In [6]:

df.iloc[:

#orequivalently

df.head（3）

7.Selectjustthe'animal'and'age'columnsfromtheDataFramedf.

In [7]:

df.loc[:

['animal','age']]

#or

df[['animal','age']]

8.Selectthedatainrows[3,4,8]andincolumns['animal','age'].

In [3]:

df.loc[df.index[[3,4,8]],['animal','age']]

9.Selectonlytherowswherethenumberofvisitsisgreaterthan3.

In [4]:

df[df['visits']>3]

10.Selecttherowswheretheageismissing,i.e.isNaN.

In [5]:

df[df['age'].isnull（）]

11.Selecttherowswheretheanimalisacatandtheageislessthan3.

In [6]:

df[（df['animal']=='cat'）&（df['age']<3）]

12.Selecttherowstheageisbetween2and4（inclusive）.

In [7]:

df[df['age'].between（2,4）]

13.Changetheageinrow'f'to1.5.

In [ ]:

df.loc['f','age']=1.5

14.Calculatethesumofallvisits（thetotalnumberofvisits）.

In [ ]:

df['visits'].sum（）

15.Calculatethemeanageforeachdifferentanimalindf.

In [8]:

df.groupby（'animal'）['age'].mean（）

16.Appendanewrow'k'todfwithyourchoiceofvaluesforeachcolumn.Thendeletethatrowtoreturnthe

originalDataFrame.

In [ ]:

df.loc['k']=[5.5,'dog','no',2]

#andthendeletingthenewrow...

df=df.drop（'k'）

17.Countthenumberofeachtypeofanimalindf.

In [9]:

df['animal'].value_counts（）

18.Sortdffirstbythevaluesinthe'age'indecendingorder,thenbythevalueinthe'visit'columnin

ascendingorder.

In [10]:

df.sort_values（by=['age','visits'],ascending=[False,True]）

19.The'priority'columncontainsthevalues'yes'and'no'.Replacethiscolumnwithacolumnofboolean

values:

'yes'shouldbeTrueand'no'shouldbeFalse.

In [ ]:

df['priority']=df['priority'].map（{'yes':

True,'no':

False}）

In [14]:

df['animal']=df['animal'].replace（'snake','python'）

print（df）

21.Foreachanimaltypeandeachnumberofvisits,findthemeanage.Inotherwords,eachrowisananimal,

eachcolumnisanumberofvisitsandthevaluesarethemeanages（hint:

useapivottable）.

In [15]:

df.pivot_table（index='animal',columns='visits',values='age',aggfunc='mean'）

22.YouhaveaDataFramedfwithacolumn'A'ofintegers.Forexample:

df=pd.DataFrame（{'A':

[1,2,2,3,4,5,5,5,6,7,7]}）

Howdoyoufilteroutrowswhichcontainthesameintegerastherowimmediatelyabove?

In [16]:

df=pd.DataFrame（{'A':

[1,2,2,3,4,5,5,5,6,7,7]}）

df.loc[df['A'].shift（）!

=df['A']]

#Alternatively,wecouldusedrop_duplicates（）here.Note

#thatthisremoves*all*duplicatesthough,soitwon't

23.GivenaDataFrameofnumericvalues,say

df=pd.DataFrame（np.random.random（size=（5,3）））#a5x3frameoffloatvalu

howdoyousubtracttherowmeanfromeachelementintherow?

In [ ]:

df.sub（df.mean（axis=1）,axis=0）

24.SupposeyouhaveDataFramewith10columnsofrealnumbers,forexample:

df=pd.DataFrame（np.random.random（size=（5,10））,columns=list（'abcdefghij'

））

Whichcolumnofnumbershasthesmallestsum?

（（Findthatcolumn'slabel.）

In [17]:

df.sum（）.idxmin（）

25.HowdoyoucounthowmanyuniquerowsaDataFramehas（i.e.ignoreallrowsthatareduplicates）?

In [ ]:

len（df）-df.duplicated（keep=False）.sum（）

#orperhapsmoresimply...

len（df.drop_duplicates（keep=False））

26.YouhaveaDataFramethatconsistsof10columnsoffloating--pointnumbers.Supposethatexactly5

entriesineachrowareNaNvalues.ForeachrowoftheDataFrame,findthecolumnwhichcontainsthethird

NaNvalue.

（YoushouldreturnaSeriesofcolumnlabels.）

In [ ]:

（df.isnull（）.cumsum（axis=1）==3）.idxmax（axis=1）

27.ADataFramehasacolumnofgroups'grps'andandcolumnofnumbers'vals'.Forexample:

df=pd.DataFrame（{'grps':

list（'aaabbcaabcccbbc'）,

'vals':

[12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]}）

In [ ]:

df.groupby（'grp'）['vals'].nlargest（3）.sum（level=0）

28.ADataFramehastwointegercolumns'A'and'B'.Thevaluesin'A'arebetween1and100（inclusive）.For

eachgroupof10consecutiveintegersin'A'（i.e.（0,10],（10,20],...）,calculatethesumofthe

correspondingvaluesincolumn'B'.

In [ ]:

df.groupby（pd.cut（df['A'],np.arange（0,101,10）））['B'].sum（）

29.ConsideraDataFramedfwherethereisanintegercolumn'X':

df=pd.DataFrame（{'X':

[7,2,0,3,4,2,5,0,3,4]}）

Foreachvalue,countthedifferencebacktothepreviouszero（orthestartoftheSeries,whicheveriscloser）.

Thesevaluesshouldthereforebe[1,2,0,1,2,3,4,0,1,2].Makethisanewcolumn'Y'.

In [ ]:

izero=np.r_[-1,（df['X']==0）.nonzero（）[0]]#indicesofzeros

idx=np.arange（len（df））

df['Y']=idx-izero[np.searchsorted（izero-1,idx）-1]

30.ConsideraDataFramecontainingrowsandcolumnsofpurelynumericaldata.Createalistoftherowcolumnindexlocationsofthe3largestvalues.

In [ ]:

df.unstack（）.sort_values（）[-3:

].index.tolist（）

31.GivenaDataFramewithacolumnofgroupIDs,'grps',andacolumnofcorrespondingintegervalues,

'vals',replaceanynegativevaluesin'vals'withthegroupmean.

In [ ]:

defreplace（group）:

mask=group<0

group[mask]=group[~mask].mean（）

returngroup

df.groupby（['grps']）['vals'].transform（replace）

32.Implementarollingmeanovergroupswithwindowsize3,whichignoresNaNvalue.Forexampleconsider

thefollowingDataFrame:

>>>df=pd.DataFrame（{'group':

list（'aabbabbbabab'）,

'value':

[1,2,3,np.nan,2,3,

np.nan,1,7,3,np.nan,8]}）

>>>df

groupvalue

0a1.0

1a2.0

2b3.0

3bNaN

4a2.0

5b3.0

6bNaN

7b1.0

8a7.0

9b3.0

10aNaN

11b8.0

ThegoalistocomputetheSeries:

01.000000

11.500000

23.000000

33.000000

41.666667

53.000000

63.000000

72.000000

83.666667

92.000000

104.500000

114.000000

In [ ]:

g1=df.groupby（['group']）['value']#groupvalues

g2=df.fillna（0）.groupby（['group']）['value']#fillna,thengroupvalues

s=g2.rolling（3,min_periods=1）.sum（）/g1.rolling（3,min_periods=1）.count（）#comp

s.reset_index（level=0,drop=True）.sort_index（）

33.CreateaDatetimeIndexthatcontainseachbusinessdayof2015anduseittoindexaSeriesofrandom

numbers.Let'scallthisSeriess.

In [ ]:

dti=pd.date_range（start='2015-01-01',end='2015-12-31',freq='B'）

s=pd.Series（np.random.rand（len（dti））,index=dti）

34.FindthesumofthevaluesinsforeveryWednesday

In [ ]:

s[s.index.weekday==2].sum（）

35.Foreachcalendarmonthins,findthemeanofvalues.

In [ ]:

s.resample（'M'）.mean（）

36.Foreachgroupoffourconsecutivecalendarmonthsins,findthedateonwhichthehighestvalue

occurred.

In [ ]:

s.groupby（pd.TimeGrouper（'4M'））.idxmax（）

37.CreateaDateTimeIndexconsistingofthethirdThursdayineachmonthfortheyears2015and2016.

In [ ]:

pd.date_range（'2015-01-01','2016-12-31',freq='WOM-3THU'）

38.SomevaluesinthetheFlightNumbercolumnaremissing.Thesenumbersaremeanttoincreaseby10witheachrowso10055and10075needtobeputinplace.Fillinthesemissingnumbersandmakethecolumnan

integercolumn（insteadofafloatcolumn）

In [ ]df['FlightNumber']=df['FlightNumber'].interpolate（）.astype（int）

39.TheFrom_Tocolumnwouldbebetterastwoseparatecolumns!

Spliteachstringontheunderscore

delimiter_togiveanewtemporaryDataFramewiththecorrectvalues.Assignthecorrectcolumnnamesto

thistemporaryDataFrame.

In [ ]:

temp=df.From_To.str.split（'_',expand=True）

temp.columns=['From','To']

40.NoticehowthecapitalisationofthecitynamesisallmixedupinthistemporaryDataFrame.Standardise

thestringssothatonlythefirstletterisuppercase（e.g."londON"shouldbecome"London".）

In [ ]

temp['From']=temp['From'].str.capitalize（）

temp['To']=temp['To'].str.capitalize（）

41.DeletetheFrom_TocolumnfromdfandattachthetemporaryDataFramefromthepreviousquestions.

In [ ]:

df=df.drop（'From_To',axis=1）

df=df.join（temp）

42.IntheAirlinecolumn,youcanseesomeextrapuctuationandsymbolshaveappearedaroundtheairline

names.Pulloutjusttheairlinename.E.g.'（BritishAirways.）'shouldbecome'British

Airways'.

In [ ]:

df['Airline']=df['Airline'].str.extract（'（[a-zA-Z\s]+）',expand=False）.str.strip（）

#note:

using.strip（）getsridofanyleading/trailing

43.IntheRecentDelayscolumn,thevalueshavebeenenteredintotheDataFrameasalist.Wewouldlikeeach

firstvalueinitsowncolumn,eachsecondvalueinitsowncolumn,andsoon.Ifthereisn'tanNthvalue,the

valueshouldbeNaN.

ExpandtheSeriesoflistsintoaDataFramenameddelays,renamethecolumnsdelay_1,delay_2,

etc.andreplacetheunwantedRecentDelayscolumnindfwithdelays.In [ ]:

delays=df['RecentDelays'].apply（pd.Series）

delays.columns=['delay_{}'.format（n）forninrange（1,len（delays.columns）+1）]

df=df.drop（'RecentDelays',axis=1）.join（delays）

44.Giventhelistsletters=['A','B','C']andnum

展开阅读全文