r - 结合选择和变异

原文 标签 r dplyr

Combine select and mutate

Quite often, I find myself manually combining select() and mutate() functions within dplyr. This is usually because I'm tidying up a dataframe, want to create new columns based on the old columns, and only want keep the new columns.

For example, if I had data about heights and widths but only wanted to use them to calculate and keep the area then I would use:

library(dplyr)
df <- data.frame(height = 1:3, width = 10:12)

df %>% 
  mutate(area = height * width) %>% 
  select(area)

When there are a lot of variables being created in the mutate step it can be difficult to make sure they're all in the select step. Is there a more elegant way to only keep the variables defined in the mutate step?

One workaround I've been using is the following:

df %>%
  mutate(id = row_number()) %>%
  group_by(id) %>%
  summarise(area = height * width) %>%
  ungroup() %>%
  select(-id)

This works but is pretty verbose, and the use of summarise() means there's a performance hit:

library(microbenchmark)

microbenchmark(

  df %>% 
    mutate(area = height * width) %>% 
    select(area),

  df %>%
    mutate(id = row_number()) %>%
    group_by(id) %>%
    summarise(area = height * width) %>%
    ungroup() %>%
    select(-id)
)

Output:

      min       lq     mean   median       uq      max neval cld
  868.822  954.053 1258.328 1147.050 1363.251 4369.544   100  a 
 1897.396 1958.754 2319.545 2247.022 2549.124 4025.050   100   b

I'm thinking there's another workaround where you can compare the original dataframe names with the new dataframe names and take the right complement, but maybe there's a better way?

I feel like I'm missing something really obvious in the dplyr documentation, so apologies if this is trivial!

Answer

Just create your own function that combines the two steps:

mutate_only = function (.data, ...) {
    names = names(match.call(expand.dots = FALSE)$...)
    .data %>% mutate(...) %>% select(one_of(names))
}

This needs some work to function properly with standard evaluation. Unfortunately the dplyr API is currently evolving on that point so I don’t know what the recommendation for this will be in a few weeks’ time. Therefore I’ll just refer to the relevant documentation.

翻译

通常,我发现自己在dplyr中手动组合了select()和mutate()函数。这通常是因为我正在整理数据框,想基于旧列创建新列,而只想保留新列。

例如,如果我有关于高度和宽度的数据,但只想使用它们来计算和保留面积,那么我将使用:

library(dplyr)
df <- data.frame(height = 1:3, width = 10:12)

df %>% 
  mutate(area = height * width) %>% 
  select(area)


当在mutate步骤中创建大量变量时,很难确保它们都在select步骤中。有没有更优雅的方法来仅保留在mutate步骤中定义的变量?

我一直在使用的一种解决方法是:

df %>%
  mutate(id = row_number()) %>%
  group_by(id) %>%
  summarise(area = height * width) %>%
  ungroup() %>%
  select(-id)


这有效,但是非常冗长,并且使用summarise()意味着性能受到影响:

library(microbenchmark)

microbenchmark(

  df %>% 
    mutate(area = height * width) %>% 
    select(area),

  df %>%
    mutate(id = row_number()) %>%
    group_by(id) %>%
    summarise(area = height * width) %>%
    ungroup() %>%
    select(-id)
)


输出:

      min       lq     mean   median       uq      max neval cld
  868.822  954.053 1258.328 1147.050 1363.251 4369.544   100  a 
 1897.396 1958.754 2319.545 2247.022 2549.124 4025.050   100   b


我在想还有另一种解决方法,您可以将原始数据框名称与新数据框名称进行比较,并采用正确的补码,但是也许有更好的方法吗?

我觉得我在dplyr文档中确实缺少一些明显的东西,如果这很琐碎,请您道歉!
最佳答案
只需创建将两个步骤结合在一起的自己的函数:

mutate_only = function (.data, ...) {
    names = names(match.call(expand.dots = FALSE)$...)
    .data %>% mutate(...) %>% select(one_of(names))
}


这需要一些工作才能正常进行标准评估。不幸的是,dplyr API目前正在朝着这一点发展,因此我不知道几周后对此的建议是什么。因此,我将仅参考relevant documentation
相关推荐

r - plotly_build修改图例和标签

r - 在Shiny Module中的renderUI()中使用lapply()

r - 防止导航栏在flexdashboard R中重叠内容

r - %in%运算符vs ==,处理NA

r - 在Web浏览器中打开R Shiny应用程序时,传单多边形失去颜色

r - 确定哪些字符列应转换为因数的标准

r - 通过具有任意函数的非整数因子聚合栅格

r - 在R中使用正则表达式序列的正则表达式数据清除

javascript - 在JavaScript中检索R对象属性

r - 在stat_contour中设置特定于刻面的中断