完整版计算机语言python100道pandas含答案.docx
《完整版计算机语言python100道pandas含答案.docx》由会员分享,可在线阅读,更多相关《完整版计算机语言python100道pandas含答案.docx(13页珍藏版)》请在冰豆网上搜索。
![完整版计算机语言python100道pandas含答案.docx](https://file1.bdocx.com/fileroot1/2023-1/11/8c777a39-e788-4d67-b66b-f4191672895b/8c777a39-e788-4d67-b66b-f4191672895b1.gif)
完整版计算机语言python100道pandas含答案
1.Importpandasunderthenamepd.
In [1]:
importpandasaspd
importnumpyasnp
2.Printtheversionofpandasthathasbeenimported.
In [2]:
pd.__version_
3.Printoutalltheversioninformationofthelibrariesthatarerequiredbythepandaslibrary
In [3]:
pd.show_versions()
4.CreateaDataFramedffromthisdictionarydatawhichhastheindexlabels.
In [2]:
data={'animal':
['cat','cat','snake','dog','dog','cat','snake','cat','dog
'age':
[2.5,3,0.5,np.nan,5,2,4.5,np.nan,7,3],
'visits':
[1,3,2,3,2,3,1,1,2,1],
'priority':
['yes','yes','no','yes','no','no','no','yes','no','no']
labels=['a','b','c','d','e','f','g','h','i','j']
df=pd.DataFrame(data,index=labels)
5.DisplayasummaryofthebasicinformationaboutthisDataFrameanditsdata.
In [5]:
df.info()
#...or...
df.describe()
6.Returnthefirst3rowsoftheDataFramedf
In [6]:
df.iloc[:
3]
#orequivalently
df.head(3)
7.Selectjustthe'animal'and'age'columnsfromtheDataFramedf.
In [7]:
df.loc[:
['animal','age']]
#or
df[['animal','age']]
8.Selectthedatainrows[3,4,8]andincolumns['animal','age'].
In [3]:
df.loc[df.index[[3,4,8]],['animal','age']]
9.Selectonlytherowswherethenumberofvisitsisgreaterthan3.
In [4]:
df[df['visits']>3]
10.Selecttherowswheretheageismissing,i.e.isNaN.
In [5]:
df[df['age'].isnull()]
11.Selecttherowswheretheanimalisacatandtheageislessthan3.
In [6]:
df[(df['animal']=='cat')&(df['age']<3)]
12.Selecttherowstheageisbetween2and4(inclusive).
In [7]:
df[df['age'].between(2,4)]
13.Changetheageinrow'f'to1.5.
In [ ]:
df.loc['f','age']=1.5
14.Calculatethesumofallvisits(thetotalnumberofvisits).
In [ ]:
df['visits'].sum()
15.Calculatethemeanageforeachdifferentanimalindf.
In [8]:
df.groupby('animal')['age'].mean()
16.Appendanewrow'k'todfwithyourchoiceofvaluesforeachcolumn.Thendeletethatrowtoreturnthe
originalDataFrame.
In [ ]:
df.loc['k']=[5.5,'dog','no',2]
#andthendeletingthenewrow...
df=df.drop('k')
17.Countthenumberofeachtypeofanimalindf.
In [9]:
df['animal'].value_counts()
18.Sortdffirstbythevaluesinthe'age'indecendingorder,thenbythevalueinthe'visit'columnin
ascendingorder.
In [10]:
df.sort_values(by=['age','visits'],ascending=[False,True])
19.The'priority'columncontainsthevalues'yes'and'no'.Replacethiscolumnwithacolumnofboolean
values:
'yes'shouldbeTrueand'no'shouldbeFalse.
In [ ]:
df['priority']=df['priority'].map({'yes':
True,'no':
False})
In [14]:
df['animal']=df['animal'].replace('snake','python')
print(df)
21.Foreachanimaltypeandeachnumberofvisits,findthemeanage.Inotherwords,eachrowisananimal,
eachcolumnisanumberofvisitsandthevaluesarethemeanages(hint:
useapivottable).
In [15]:
df.pivot_table(index='animal',columns='visits',values='age',aggfunc='mean')
22.YouhaveaDataFramedfwithacolumn'A'ofintegers.Forexample:
df=pd.DataFrame({'A':
[1,2,2,3,4,5,5,5,6,7,7]})
Howdoyoufilteroutrowswhichcontainthesameintegerastherowimmediatelyabove?
In [16]:
df=pd.DataFrame({'A':
[1,2,2,3,4,5,5,5,6,7,7]})
df.loc[df['A'].shift()!
=df['A']]
#Alternatively,wecouldusedrop_duplicates()here.Note
#thatthisremoves*all*duplicatesthough,soitwon't
23.GivenaDataFrameofnumericvalues,say
df=pd.DataFrame(np.random.random(size=(5,3)))#a5x3frameoffloatvalu
es
howdoyousubtracttherowmeanfromeachelementintherow?
In [ ]:
df.sub(df.mean(axis=1),axis=0)
24.SupposeyouhaveDataFramewith10columnsofrealnumbers,forexample:
df=pd.DataFrame(np.random.random(size=(5,10)),columns=list('abcdefghij'
))
Whichcolumnofnumbershasthesmallestsum?
((Findthatcolumn'slabel.)
In [17]:
df.sum().idxmin()
25.HowdoyoucounthowmanyuniquerowsaDataFramehas(i.e.ignoreallrowsthatareduplicates)?
In [ ]:
len(df)-df.duplicated(keep=False).sum()
#orperhapsmoresimply...
len(df.drop_duplicates(keep=False))
26.YouhaveaDataFramethatconsistsof10columnsoffloating--pointnumbers.Supposethatexactly5
entriesineachrowareNaNvalues.ForeachrowoftheDataFrame,findthecolumnwhichcontainsthethird
NaNvalue.
(YoushouldreturnaSeriesofcolumnlabels.)
In [ ]:
(df.isnull().cumsum(axis=1)==3).idxmax(axis=1)
27.ADataFramehasacolumnofgroups'grps'andandcolumnofnumbers'vals'.Forexample:
df=pd.DataFrame({'grps':
list('aaabbcaabcccbbc'),
'vals':
[12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
In [ ]:
df.groupby('grp')['vals'].nlargest(3).sum(level=0)
28.ADataFramehastwointegercolumns'A'and'B'.Thevaluesin'A'arebetween1and100(inclusive).For
eachgroupof10consecutiveintegersin'A'(i.e.(0,10],(10,20],...),calculatethesumofthe
correspondingvaluesincolumn'B'.
In [ ]:
df.groupby(pd.cut(df['A'],np.arange(0,101,10)))['B'].sum()
29.ConsideraDataFramedfwherethereisanintegercolumn'X':
df=pd.DataFrame({'X':
[7,2,0,3,4,2,5,0,3,4]})
Foreachvalue,countthedifferencebacktothepreviouszero(orthestartoftheSeries,whicheveriscloser).
Thesevaluesshouldthereforebe[1,2,0,1,2,3,4,0,1,2].Makethisanewcolumn'Y'.
In [ ]:
izero=np.r_[-1,(df['X']==0).nonzero()[0]]#indicesofzeros
idx=np.arange(len(df))
df['Y']=idx-izero[np.searchsorted(izero-1,idx)-1]
30.ConsideraDataFramecontainingrowsandcolumnsofpurelynumericaldata.Createalistoftherowcolumnindexlocationsofthe3largestvalues.
In [ ]:
df.unstack().sort_values()[-3:
].index.tolist()
31.GivenaDataFramewithacolumnofgroupIDs,'grps',andacolumnofcorrespondingintegervalues,
'vals',replaceanynegativevaluesin'vals'withthegroupmean.
In [ ]:
defreplace(group):
mask=group<0
group[mask]=group[~mask].mean()
returngroup
df.groupby(['grps'])['vals'].transform(replace)
32.Implementarollingmeanovergroupswithwindowsize3,whichignoresNaNvalue.Forexampleconsider
thefollowingDataFrame:
>>>df=pd.DataFrame({'group':
list('aabbabbbabab'),
'value':
[1,2,3,np.nan,2,3,
np.nan,1,7,3,np.nan,8]})
>>>df
groupvalue
0a1.0
1a2.0
2b3.0
3bNaN
4a2.0
5b3.0
6bNaN
7b1.0
8a7.0
9b3.0
10aNaN
11b8.0
ThegoalistocomputetheSeries:
01.000000
11.500000
23.000000
33.000000
41.666667
53.000000
63.000000
72.000000
83.666667
92.000000
104.500000
114.000000
In [ ]:
g1=df.groupby(['group'])['value']#groupvalues
g2=df.fillna(0).groupby(['group'])['value']#fillna,thengroupvalues
s=g2.rolling(3,min_periods=1).sum()/g1.rolling(3,min_periods=1).count()#comp
s.reset_index(level=0,drop=True).sort_index()
33.CreateaDatetimeIndexthatcontainseachbusinessdayof2015anduseittoindexaSeriesofrandom
numbers.Let'scallthisSeriess.
In [ ]:
dti=pd.date_range(start='2015-01-01',end='2015-12-31',freq='B')
s=pd.Series(np.random.rand(len(dti)),index=dti)
34.FindthesumofthevaluesinsforeveryWednesday
In [ ]:
s[s.index.weekday==2].sum()
35.Foreachcalendarmonthins,findthemeanofvalues.
In [ ]:
s.resample('M').mean()
36.Foreachgroupoffourconsecutivecalendarmonthsins,findthedateonwhichthehighestvalue
occurred.
In [ ]:
s.groupby(pd.TimeGrouper('4M')).idxmax()
37.CreateaDateTimeIndexconsistingofthethirdThursdayineachmonthfortheyears2015and2016.
In [ ]:
pd.date_range('2015-01-01','2016-12-31',freq='WOM-3THU')
38.SomevaluesinthetheFlightNumbercolumnaremissing.Thesenumbersaremeanttoincreaseby10witheachrowso10055and10075needtobeputinplace.Fillinthesemissingnumbersandmakethecolumnan
integercolumn(insteadofafloatcolumn)
In [ ]df['FlightNumber']=df['FlightNumber'].interpolate().astype(int)
39.TheFrom_Tocolumnwouldbebetterastwoseparatecolumns!
Spliteachstringontheunderscore
delimiter_togiveanewtemporaryDataFramewiththecorrectvalues.Assignthecorrectcolumnnamesto
thistemporaryDataFrame.
In [ ]:
temp=df.From_To.str.split('_',expand=True)
temp.columns=['From','To']
40.NoticehowthecapitalisationofthecitynamesisallmixedupinthistemporaryDataFrame.Standardise
thestringssothatonlythefirstletterisuppercase(e.g."londON"shouldbecome"London".)
In [ ]
temp['From']=temp['From'].str.capitalize()
temp['To']=temp['To'].str.capitalize()
41.DeletetheFrom_TocolumnfromdfandattachthetemporaryDataFramefromthepreviousquestions.
In [ ]:
df=df.drop('From_To',axis=1)
df=df.join(temp)
42.IntheAirlinecolumn,youcanseesomeextrapuctuationandsymbolshaveappearedaroundtheairline
names.Pulloutjusttheairlinename.E.g.'(BritishAirways.)'shouldbecome'British
Airways'.
In [ ]:
df['Airline']=df['Airline'].str.extract('([a-zA-Z\s]+)',expand=False).str.strip()
#note:
using.strip()getsridofanyleading/trailing
43.IntheRecentDelayscolumn,thevalueshavebeenenteredintotheDataFrameasalist.Wewouldlikeeach
firstvalueinitsowncolumn,eachsecondvalueinitsowncolumn,andsoon.Ifthereisn'tanNthvalue,the
valueshouldbeNaN.
ExpandtheSeriesoflistsintoaDataFramenameddelays,renamethecolumnsdelay_1,delay_2,
etc.andreplacetheunwantedRecentDelayscolumnindfwithdelays.In [ ]:
delays=df['RecentDelays'].apply(pd.Series)
delays.columns=['delay_{}'.format(n)forninrange(1,len(delays.columns)+1)]
df=df.drop('RecentDelays',axis=1).join(delays)
44.Giventhelistsletters=['A','B','C']andnum