1、原创R语言用dplyr编程 附代码数据R语言用dplyr编程用dplyr编程需要一些特殊的知识,因为大多数dplyr动词不是正常的函数。 他们在引用功能。 在这个小插曲中,您将学习引用函数,编程所带来的【原创】R语言报告论文(附代码数据)有问题到淘宝找“大数据部落”就可以了 挑战以及整体评估如何解决这些问题。介绍常规功能与引用功能R函数可以分为两大类:常规函数和引用函数。 这些函数在获取参数方面有所不同。 常规函数只能看到值。 提供的表达式作为参数或者涉及哪些变量并不重要。 该值是按照标准评估规则1来计算的。基本正则函数是identity(),它返回其参数的值。 因为只有价值是重要的,所有这些
2、陈述是完全等同的:identity(6)# 1 6identity(2 *3)# 1 6a -2b 1 6另一方面,引用函数将该表达式视为参数而不是该表达式的值。 表达式可能会稍后评估,也可能根本没有评估。 基本的引用函数是quote(),它返回表达式的参数:quote(6)# 1 6quote(2 *3)# 2 * 3quote(a *b)# a * b实际上,引用的行为是所有程序员都熟悉的,因为这是他们创建字符串。 “是一个引号操作符,它是一个表示提供的字符不是代码而是文本的信号,作为一个R程序员,你也可能熟悉公式operator,这个引用操作符返回一个或两个带引号的表达式, 表达式正在做
3、类似的事情,他们引用了他们的意见:a * b# 1 a * ba *b# a * bquote(a *b)# a * b第一个语句返回一个带引号的字符串,另外两个返回一个公式中的带引号的代码或作为一个裸表达式。改变评估的背景引用的表达式可以使用函数eval()进行评估。 我们引用一个表达式来表示从1到5的小写字母的子集,并对此进行评估:x letters1:5eval(x)# 1 abcde当然,这不是很令人印象深刻,你可以正常输入表达式来获得这个值。 但是R最重要的特点之一就是可以改变评估的上下文以获得不同的结果。 上下文也称为环境,基本上是一个将符号链接到值的集合。 包的命名空间就是这样的
4、上下文。 例如,在基本名称空间的上下文中,符号字母被赋予小写字母的字符向量的值。 但是,在另一种情况下,这可能意味着不同。 我们可以创建一个上下文,其中字母以相反的顺序表示大写字母! 在这种情况下评估引用的表达式可能会返回完全不同的结果:context letters1:5eval(x, context)# 1 ZYXWV有趣的是,数据框架可以用作评估上下文。 在数据框上下文中,列名称代表向量,以便您可以在表达式中引用这些列:data1 -tibble(mass =c(70, 80, 90), height =1.6, 1.7, 1.8)data2 -tibble(mass =c(75, 85
5、, 95), height =1.5, 1.7, 1.9)bmi_expr 1 27.3 31.2 35.2eval(bmi_expr, data2)# 1 33.3 37.8 42.2在最后一个片段中,我们使用quote()创建一个表达式,然后用eval()手动评估它。 然而,引用函数通常在幕后执行引用和评估:with(data1, mass /height2)# 1 27.3 31.2 35.2with(data2, mass /height2)# 1 33.3 37.8 42.2出于这个原因,除了用户表达式之外,引用函数通常还将数据帧作为输入,以便可以在数据的上下文中对其进行评估。这是一
6、个强大的功能,它使R成为一种面向数据的编程语言。 R中的引用函数无处不在:owith(data,expr)在数据的上下文中计算expr。olm(公式,数据)创建一个设计矩阵,预测器在数据上下文中进行评估。omutate(data,new = expr)根据在数据上下文中计算的表达式创建一个新列。oggplot(data,aes(expr)将x审美定义为在数据上下文中评估的expr的值。在dplyr接口的上下文中,引用参数有两个好处:数据帧的操作可以简洁地表达,因为您不需要重复数据帧的名称。例如,你可以写过滤器(df,x = 1,y = 2,z = 3)而不是df df $ x = 1df $
7、y = 2df $ z = 3, 。odplyr可以选择以不同的方式计算结果到base R.这对于数据库后端很重要,因为dplyr本身并没有做任何工作,而是生成告诉数据库做什么的SQL。不幸的是,引用函数的好处并不是免费的。虽然它们简化了直接输入,但却使编程输入更加困难。当你使用dplyr时,为你引用工作,但是当你的函数使用dplyr时,对你有用。Varying quoted inputsThe issue of referential transparency to do with the difficulty of passing contextual variables in order
8、 to vary the inputs of quoting functions. When you pass variables to quoting functions they get quoted along with the rest of the expression.To see the problem more clearly, lets define a simple quoting function2that pastes its inputs as a string:cement -function(., .sep =) strings 1 it is rainyceme
9、nt(it, is, rainy)# 1 it is rainyNow what if we wanted to store the weather adjective in a variable?paste()has no issue on that front because it gets the value of the argument rather than its expression. On the other hand if we pass a variable tocement(), it would be quoted just like the other inputs
10、 andcement()would never get to see its contents:x 1 it is shinycement(it, is, x)# 1 it is xThe solution to this problem is a special syntax that signals the quoting function that part of the argument is to be unquoted, i.e., evaluated right away. The ability to mix quoting and evaluation is calledqu
11、asiquotationand is the main tidy eval feature.QuasiquotationPut simply, quasi-quotation enables one to introduce symbols that stand for a linguistic expression in a given instance and are used as that linguistic expression in a different instance. Willard van Orman QuineAs we have seen, automatic qu
12、oting makes R and dplyr very convenient for interactive use but makes it difficult to refer to variable inputs. The solution to this problem isquasiquotation, which allows you to evaluate directly inside an expression that is otherwise quoted. Quasiquotation was coined by Willard van Orman Quine in
13、the 1940s, and was adopted for programming by the LISP community in the 1970s. Quasiquotation is available (or will soon be) in all quoting functions of the tidyverse thanks to the tidy evaluation framework.The bang! bang! operatorThe tidy eval syntax for unquoting is!. Anything supplied to to this
14、operator is evaluated right away and the result is substituted in place. Lets see!in action in ourcement()function:x 1 it is shinyEven though the arguments are quoted,! xsignals thatxshould be evaluated right away. Fromcement()perspective, its as if the user had typedshinyinstead of! x.We have seen
15、above that the fundamental quoting function in base R isquote(). In the tidyverse, it isexpr(). All it does is to quote its argument with quasiquotation support and returns it right away:expr(x)# xexpr(!x)# 1 shinyexpr()is especially useful for debugging quasiquotation. You can wrap it around any ex
16、pression in which you use!to examine the effect of unquoting. Lets try it withcement():expr(cement(it, is, !x)# cement(it, is, shiny)This technique is essential to work your way around to mastering tidy eval.Creating symbolsNow that we are armed with quasiquotation, lets try to program with the dply
17、r verbmutate(). Well take a BMI computation as running example.# Rescale heightstarwars # A tibble: 87 x 1# bmi# 1 26.0# 2 26.9# 3 34.7# 4 33.3# # . with 83 more rowsLets say we want to vary the height input. A first intuition might be to store the column name in a variable and unquote it. But we ge
18、t an error:x Error in mutate_impl(.data, dots): Evaluation error: non-numeric argument to binary operator.The error message indicates a type error. A binary operator expected a numeric input but got something else. The error becomes clear if we useexpr()to debug the unquoting:expr(transmute(starwars
19、, bmi = mass /(!x)2)# transmute(starwars, bmi = mass/(height)2)We are unquoting a string and thats exactly whattransmute()uses to evaluate the BMI. This cant work! We need to unquote something that looks like code instead of a string. What we are looking for is asymbol. A symbol is a string that ref
20、erences an object in a context. Symbols are the meat of R code. Infoo(bar),foois a symbol that references a function andbaris a symbol that references some object.There are two ways of creating symbolic R code objects: by quotation or by construction. We already know how to create symbols by quoting
21、. However that does not help us much because we face the same issue again, namely that the quoted symbol is a constant that cant be varied:quote(height)# heightexpr(height)# heightThe other way is to build it out of a string using the constructorsym(). Constructors are regular functions and can be p
22、rogrammed with variables:sym(height)# heightx heightLets build a symbol and try to unquote it in the transmute expression. Usingexpr()to examine the effect of unquoting, things are looking good:x transmute(starwars, bmi = mass/(height)2)And indeed it now works!transmute(starwars, bmi = mass /(!x)2)#
23、 # A tibble: 87 x 1# bmi# 1 26.0# 2 26.9# 3 34.7# 4 33.3# # . with 83 more rowsCreating a wrapper around a dplyr pipelineQuasiquotation is all we need to write our first wrapper function around a dplyr pipeline. The goal is to write reliable functions that reduce duplication in our data analysis cod
24、e. Lets say that we often take a grouped average using dplyr and our scripts are littered with little pipelines that look like this:starwars %group_by(species) %summarise(avg =mean(height)# # A tibble: 38 x 2# species avg# 1 Aleena 0.79# 2 Besalisk 1.98# 3 Cerean 1.98# 4 Chagrian 1.96# # . with 34 m
25、ore rowsIt would be a good idea to extract this logic into a function. It would reduce the risk of writing a typo and would make our code more concise as well as clearer if we choose a good name for this function.We know from the previous sections that this kind of naive wrapper will not work. The v
26、ariable names will be automatically quoted. The column names they contain will be at best ignored (in the example belowgroup_by()looks for a column namedgroup) or at worst misused (summarise()would try to take the average of the stringheight):mean_by %group_by(group) %summarise(avg =mean(var)mean_by
27、(starwars, species, height)# Error in grouped_df_impl(data, unname(vars), drop): Column group is unknownOur wrapper simply needs to construct symbols from its inputs and unquote them in the pipeline:mean_by -function(data, var, group) var -sym(var) group %group_by(!group) %summarise(avg =mean(!var)m
28、ean_by(starwars, height, species)# # A tibble: 38 x 2# species avg# 1 Aleena 0.79# 2 Besalisk 1.98# 3 Cerean 1.98# 4 Chagrian 1.96# # . with 34 more rowsmean_by(starwars, mass, eye_color)# # A tibble: 15 x 2# eye_color avg# 1 black NA# 2 blue NA# 3 blue-gray 77# 4 brown NA# # . with 11 more rowsCreating your own quoting functionsThe wrapper that we just created is a regular function. It takes strings and doesnt quote any of its inputs. This has the advantage t
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1