原创R语言用dplyr编程附代码数据.docx

资源描述

原创R语言用dplyr编程附代码数据.docx

《原创R语言用dplyr编程附代码数据.docx》由会员分享，可在线阅读，更多相关《原创R语言用dplyr编程附代码数据.docx（20页珍藏版）》请在冰豆网上搜索。

原创R语言用dplyr编程附代码数据.docx

原创R语言用dplyr编程附代码数据

R语言用dplyr编程

用dplyr编程需要一些特殊的知识，因为大多数dplyr动词不是正常的函数。

他们在引用功能。

在这个小插曲中，您将学习引用函数，编程所带来的

【原创】R语言报告论文（附代码数据）

有问题到淘宝找“大数据部落”就可以了挑战以及整体评估如何解决这些问题。

介绍

常规功能与引用功能

R函数可以分为两大类：

常规函数和引用函数。

这些函数在获取参数方面有所不同。

常规函数只能看到值。

提供的表达式作为参数或者涉及哪些变量并不重要。

该值是按照标准评估规则1来计算的。

基本正则函数是identity（），它返回其参数的值。

因为只有价值是重要的，所有这些陈述是完全等同的：

identity（6）

#>[1]6

identity（2*3）

#>[1]6

a<-2

b<-3

identity（a*b）

#>[1]6

另一方面，引用函数将该表达式视为参数而不是该表达式的值。

表达式可能会稍后评估，也可能根本没有评估。

基本的引用函数是quote（），它返回表达式的参数：

quote（6）

#>[1]6

quote（2*3）

#>2*3

quote（a*b）

#>a*b

实际上，引用的行为是所有程序员都熟悉的，因为这是他们创建字符串。

“是一个引号操作符，它是一个表示提供的字符不是代码而是文本的信号，作为一个R程序员，你也可能熟悉公式operator〜，这个引用操作符返回一个或两个带引号的表达式，表达式正在做类似的事情，他们引用了他们的意见：

"a*b"

#>[1]"a*b"

~a*b

#>~a*b

quote（a*b）

#>a*b

第一个语句返回一个带引号的字符串，另外两个返回一个公式中的带引号的代码或作为一个裸表达式。

改变评估的背景

引用的表达式可以使用函数eval（）进行评估。

我们引用一个表达式来表示从1到5的小写字母的子集，并对此进行评估：

x<-quote（letters[1:

5]）

#>letters[1:

eval（x）

#>[1]"a""b""c""d""e"

当然，这不是很令人印象深刻，你可以正常输入表达式来获得这个值。

但是R最重要的特点之一就是可以改变评估的上下文以获得不同的结果。

上下文也称为环境，基本上是一个将符号链接到值的集合。

包的命名空间就是这样的上下文。

例如，在基本名称空间的上下文中，符号字母被赋予小写字母的字符向量的值。

但是，在另一种情况下，这可能意味着不同。

我们可以创建一个上下文，其中字母以相反的顺序表示大写字母！

在这种情况下评估引用的表达式可能会返回完全不同的结果：

context<-list（letters=rev（LETTERS））

#>letters[1:

eval（x,context）

#>[1]"Z""Y""X""W""V"

有趣的是，数据框架可以用作评估上下文。

在数据框上下文中，列名称代表向量，以便您可以在表达式中引用这些列：

data1<-tibble（mass=c（70,80,90）,height=1.6,1.7,1.8）

data2<-tibble（mass=c（75,85,95）,height=1.5,1.7,1.9）

bmi_expr<-quote（mass/height^2）

eval（bmi_expr,data1）

#>[1]27.331.235.2

eval（bmi_expr,data2）

#>[1]33.337.842.2

在最后一个片段中，我们使用quote（）创建一个表达式，然后用eval（）手动评估它。

然而，引用函数通常在幕后执行引用和评估：

with（data1,mass/height^2）

#>[1]27.331.235.2

with（data2,mass/height^2）

#>[1]33.337.842.2

出于这个原因，除了用户表达式之外，引用函数通常还将数据帧作为输入，以便可以在数据的上下文中对其进行评估。

这是一个强大的功能，它使R成为一种面向数据的编程语言。

R中的引用函数无处不在：

owith（data，expr）在数据的上下文中计算expr。

olm（公式，数据）创建一个设计矩阵，预测器在数据上下文中进行评估。

omutate（data，new=expr）根据在数据上下文中计算的表达式创建一个新列。

oggplot（data，aes（expr））将x审美定义为在数据上下文中评估的expr的值。

在dplyr接口的上下文中，引用参数有两个好处：

数据帧的操作可以简洁地表达，因为您不需要重复数据帧的名称。

例如，你可以写过滤器（df，x==1，y==2，z==3）而不是df[df$x==1＆df$y==2＆df$z==3，]。

odplyr可以选择以不同的方式计算结果到baseR.这对于数据库后端很重要，因为dplyr本身并没有做任何工作，而是生成告诉数据库做什么的SQL。

不幸的是，引用函数的好处并不是免费的。

虽然它们简化了直接输入，但却使编程输入更加困难。

当你使用dplyr时，为你引用工作，但是当你的函数使用dplyr时，对你有用。

Varyingquotedinputs

Theissueofreferentialtransparencytodowiththedifficultyofpassingcontextualvariablesinordertovarytheinputsofquotingfunctions.Whenyoupassvariablestoquotingfunctionstheygetquotedalongwiththerestoftheexpression.

Toseetheproblemmoreclearly,let’sdefineasimplequotingfunction 2 thatpastesitsinputsasastring:

cement<-function（...,.sep=""）{

strings<-map（exprs（...）,as_string）

paste（strings,collapse=.sep）

}

Comparedtotheregularfunction paste（）,thequotingfunction cement（） savesabitoftypingbecauseitperformsthestring-quotingautomatically:

paste（"it","is","rainy"）

#>[1]"itisrainy"

cement（it,is,rainy）

#>[1]"itisrainy"

Nowwhatifwewantedtostoretheweatheradjectiveinavariable?

paste（） hasnoissueonthatfrontbecauseitgetsthevalueoftheargumentratherthanitsexpression.Ontheotherhandifwepassavariableto cement（）,itwouldbequotedjustliketheotherinputsand cement（）wouldnevergettoseeitscontents:

x<-"shiny"

paste（"it","is",x）

#>[1]"itisshiny"

cement（it,is,x）

#>[1]"itisx"

Thesolutiontothisproblemisaspecialsyntaxthatsignalsthequotingfunctionthatpartoftheargumentistobeunquoted,i.e.,evaluatedrightaway.Theabilitytomixquotingandevaluationiscalled quasiquotation andisthemaintidyevalfeature.

Quasiquotation

Putsimply,quasi-quotationenablesonetointroducesymbolsthatstandforalinguisticexpressioninagiveninstanceandareusedasthatlinguisticexpressioninadifferentinstance.— WillardvanOrmanQuine

Aswehaveseen,automaticquotingmakesRanddplyrveryconvenientforinteractiveusebutmakesitdifficulttorefertovariableinputs.Thesolutiontothisproblemis quasiquotation,whichallowsyoutoevaluatedirectlyinsideanexpressionthatisotherwisequoted.QuasiquotationwascoinedbyWillardvanOrmanQuineinthe1940s,andwasadoptedforprogrammingbytheLISPcommunityinthe1970s.Quasiquotationisavailable（orwillsoonbe）inallquotingfunctionsofthetidyversethankstothetidyevaluationframework.

Thebang!

bang!

operator

Thetidyevalsyntaxforunquotingis !

.Anythingsuppliedtotothisoperatorisevaluatedrightawayandtheresultissubstitutedinplace.Let’ssee !

inactioninour cement（） function:

x<-"shiny"

cement（it,is,!

x）

#>[1]"itisshiny"

Eventhoughtheargumentsarequoted, !

x signalsthat x shouldbeevaluatedrightaway.From cement（） perspective,it’sasiftheuserhadtyped "shiny" insteadof !

WehaveseenabovethatthefundamentalquotingfunctioninbaseRis quote（）.Inthetidyverse,itis expr（）.Allitdoesistoquoteitsargumentwithquasiquotationsupportandreturnsitrightaway:

expr（x）

#>x

expr（!

x）

#>[1]"shiny"

expr（） isespeciallyusefulfordebuggingquasiquotation.Youcanwrapitaroundanyexpressioninwhichyouuse !

toexaminetheeffectofunquoting.Let’stryitwith cement（）:

expr（cement（it,is,!

x））

#>cement（it,is,"shiny"）

Thistechniqueisessentialtoworkyourwayaroundtomasteringtidyeval.

Creatingsymbols

Nowthatwearearmedwithquasiquotation,let’strytoprogramwiththedplyrverb mutate（）.We’lltakeaBMIcomputationasrunningexample.

#Rescaleheight

starwars<-mutate（starwars,height=height/100）

transmute（starwars,bmi=mass/height^2）

#>#Atibble:

87x1

#>bmi

#>126.0

#>226.9

#>334.7

#>433.3

#>#...with83morerows

Let’ssaywewanttovarytheheightinput.Afirstintuitionmightbetostorethecolumnnameinavariableandunquoteit.Butwegetanerror:

x<-"height"

transmute（starwars,bmi=mass/（!

x）^2）

#>Errorinmutate_impl（.data,dots）:

Evaluationerror:

non-numericargumenttobinaryoperator.

Theerrormessageindicatesatypeerror.Abinaryoperatorexpectedanumericinputbutgotsomethingelse.Theerrorbecomesclearifweuse expr（） todebugtheunquoting:

expr（transmute（starwars,bmi=mass/（!

x）^2））

#>transmute（starwars,bmi=mass/（"height"）^2）

Weareunquotingastringandthat’sexactlywhat transmute（） usestoevaluatetheBMI.Thiscan’twork!

Weneedtounquotesomethingthatlookslikecodeinsteadofastring.Whatwearelookingforisa symbol.Asymbolisastringthatreferencesanobjectinacontext.SymbolsarethemeatofRcode.In foo（bar）, foo isasymbolthatreferencesafunctionand bar isasymbolthatreferencessomeobject.

TherearetwowaysofcreatingsymbolicRcodeobjects:

byquotationorbyconstruction.Wealreadyknowhowtocreatesymbolsbyquoting.Howeverthatdoesnothelpusmuchbecausewefacethesameissueagain,namelythatthequotedsymbolisaconstantthatcan’tbevaried:

quote（height）

#>height

expr（height）

#>height

Theotherwayistobuilditoutofastringusingtheconstructor sym（）.Constructorsareregularfunctionsandcanbeprogrammedwithvariables:

sym（"height"）

#>height

x<-"height"

sym（x）

#>height

Let’sbuildasymbolandtrytounquoteitinthetransmuteexpression.Using expr（） toexaminetheeffectofunquoting,thingsarelookinggood:

x<-sym（"height"）

expr（transmute（starwars,bmi=mass/（!

x）^2））

#>transmute（starwars,bmi=mass/（height）^2）

Andindeeditnowworks!

transmute（starwars,bmi=mass/（!

x）^2）

#>#Atibble:

87x1

#>bmi

#>126.0

#>226.9

#>334.7

#>433.3

#>#...with83morerows

Creatingawrapperaroundadplyrpipeline

Quasiquotationisallweneedtowriteourfirstwrapperfunctionaroundadplyrpipeline.Thegoalistowritereliablefunctionsthatreduceduplicationinourdataanalysiscode.Let’ssaythatweoftentakeagroupedaverageusingdplyrandourscriptsarelitteredwithlittlepipelinesthatlooklikethis:

starwars%>%

group_by（species）%>%

summarise（avg=mean（height））

#>#Atibble:

38x2

#>speciesavg

#>1Aleena0.79

#>2Besalisk1.98

#>3Cerean1.98

#>4Chagrian1.96

#>#...with34morerows

Itwouldbeagoodideatoextractthislogicintoafunction.Itwouldreducetheriskofwritingatypoandwouldmakeourcodemoreconciseaswellasclearerifwechooseagoodnameforthisfunction.

Weknowfromtheprevioussectionsthatthiskindofnaivewrapperwillnotwork.Thevariablenameswillbeautomaticallyquoted.Thecolumnnamestheycontainwillbeatbestignored（intheexamplebelow group_by（） looksforacolumnnamed group）oratworstmisused（summarise（） wouldtrytotaketheaverageofthestring "height"）:

mean_by<-function（data,var,group）{

data%>%

group_by（group）%>%

summarise（avg=mean（var））

}

mean_by（starwars,"species","height"）

#>Erroringrouped_df_impl（data,unname（vars）,drop）:

Column`group`isunknown

Ourwrappersimplyneedstoconstructsymbolsfromitsinputsandunquotetheminthepipeline:

mean_by<-function（data,var,group）{

var<-sym（var）

group<-sym（group）

data%>%

group_by（!

group）%>%

summarise（avg=mean（!

var））

}

mean_by（starwars,"height","species"）

#>#Atibble:

38x2

#>speciesavg

#>1Aleena0.79

#>2Besalisk1.98

#>3Cerean1.98

#>4Chagrian1.96

#>#...with34morerows

mean_by（starwars,"mass","eye_color"）

#>#Atibble:

15x2

#>eye_coloravg

#>1blackNA

#>2blueNA

#>3blue-gray77

#>4brownNA

#>#...with11morerows

Creatingyourownquotingfunctions

Thewrapperthatwejustcreatedisaregularfunction.Ittakesstringsanddoesn’tquoteanyofitsinputs.Thishastheadvantaget

展开阅读全文

原创R语言用dplyr编程 附代码数据.docx

原创R语言用dplyr编程附代码数据.docx