问题描述:

The following problem prevents me so far from a really flexible usage of data.table aggregations.

Example:

library(data.table)

set.seed(1)

DT <- data.table(C1=c("a","b","b"),

C2=round(rnorm(4),4),

C3=1:12,

C4=9:12)

sum_cols <- c("C2","C3")

#I want to apply a custom aggregation over multiple columns

DT[,lapply(.SD,sum),by=C1,.SDcols=sum_cols]

### Part 1 of question ###

#but what if I want to add another aggregation, e.g. count

DT[,.N,by=C1]

#this is not working as intended (creates 4 rows instead of 2 and doesnt contain sum_cols)

DT[,.(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]

### Part 2 of question ###

# or another function for another set of colums and adding a prefix to keep them appart?

mean_cols <- c("C3","C4")

#intended table structure (with 2 rows again)

C1 sum_C2 sum_C3 mean_C3 mean_C4

I know I can always merge various single aggregation results by some key but I`m sure there must be a correct, flexible and easy way to do what I would like to do (especially Part 2).

网友答案:

The first thing to notice is that data.table's j argument expects a list output, which can be built with c, as mentioned in @akrun's answer. Here are two ways to do it:

set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")

# with the development version, 1.10.1+
DT[, c(
  .N, 
  sum = lapply(.SD[, ..sum_cols], sum), 
  mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]

# in earlier versions
DT[, c(
  .N, 
  sum = lapply(.SD[, sum_cols, with=FALSE], sum), 
  mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]

mget returns a list and c connects elements together to make a list.


Comments

If you turn on the verbose data.table option for these calls, you'll see a message:

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Also, you'll see that the optimized group mean and sum are not being used (see ?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.

网友答案:

The output are lists, so we use c to concatenate both the list outputs

DT[,c(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
#    C1 N    C2 C3
# 1:  a 4 0.288 22
# 2:  b 8 0.576 56
相关阅读:
Top