R语言利器之ddply和aggregate

来源:互联网 时间:2016-11-05

ddply和aggregate是两个用来整合数据的功能强大的函数。

aggregate(x, ...)

关于aggregate()函数的使用在《R语言实战》中P105有简单描述,这里重新说一下。此函数主要有一下几种用法:

 ## Default S3 method:

aggregate(x, ...)

## S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)

## S3 method for class 'formula'

aggregate(formula, data, FUN, ...,subset, na.action = na.omit)

## S3 method for class 'ts'

aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,ts.eps = getOption("ts.eps"), ...)

 

 


例:

attach(mtcars)

aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE)

aggdata

Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb

1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000

2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000

3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333

4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000

5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000

6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000

7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000

8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000

  得到数据框aggdata,其中的Group.1和Group.2的列名可以指定,只需第二行写成:

aggdata <-aggregate(mtcars, by=list(Group.cyl=cyl, Group.gears=gear),FUN=mean, na.rm=TRUE)

 即可。

注意:在使用aggregate()函数的时候, by中的变量必须在一个列表中(即使只有一个变量) 。 指定的函数FUN可为任意的内建或自编函数 。

其他的一些例子:

## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to.

aggregate(state.x77, list(Region = state.region), mean)

## Compute the averages according to region and the occurrence of more

## than 130 days of frost.

aggregate(state.x77,

list(Region = state.region,Cold = state.x77[,"Frost"] > 130),

mean)

## (Note that no state in 'South' is THAT cold.)

## example with character variables and NAs

testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),

v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )

by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)

by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)

aggregate(x = testDF, by = list(by1, by2), FUN = "mean")

# and if you want to treat NAs as a group

fby1 <- factor(by1, exclude = "")

fby2 <- factor(by2, exclude = "")

aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")

## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:

aggregate(weight ~ feed, data = chickwts, mean)

aggregate(breaks ~ wool + tension, data = warpbreaks, mean)

aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)

## Dot notation:

aggregate(. ~ Species, data = iris, mean)

aggregate(len ~ ., data = ToothGrowth, mean)

## Often followed by xtabs():

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

xtabs(len ~ ., data = ag)

## Compute the average annual approval ratings for American presidents.

aggregate(presidents, nfrequency = 1, FUN = mean)

## Give the summer less weight.

aggregate(presidents, nfrequency = 1,

FUN = weighted.mean, w = c(1, 1, 0.5, 1))

  ddply

下面是ddply函数的一般用法:

ddply(.data, .variables, .fun = NULL, ..., .progress = "none",.inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)

  例:

# Summarize a dataset by two variables

dfx <- data.frame(

group = c(rep('A', 8), rep('B', 15), rep('C', 6)),

sex = sample(c("M", "F"), size = 29, replace = TRUE),

age = runif(n = 29, min = 18, max = 54)

)

head(dfx)

group sex age

1 A M 22.44750

2 A M 52.92616

3 A F 30.00443

4 A M 39.56907

5 A M 18.89180

6 A F 50.81139

#Note the use of the '.' function to allow

# group and sex to be used without quoting

ddply(dfx, .(group, sex), summarize,mean = round(mean(age), 2),sd = round(sd(age), 2))

group sex mean sd

1 A F 40.41 14.71

2 A M 30.35 13.17

3 B F 34.81 12.76

4 B M 34.04 13.36

5 C F 35.09 13.39

6 C M 28.53 4.57

# An example using a formula for .variables

ddply(baseball[1:100,], ~ year, nrow)

year V1

1 1871 7

2 1872 13

3 1873 13

4 1874 15

5 1875 17

6 1876 15

7 1877 17

8 1878 3

# Applying two functions; nrow and ncol

ddply(baseball, .(lg), c("nrow", "ncol"))

lg nrow ncol

1 65 22

2 AA 171 22

3 AL 10007 22

4 FL 37 22

5 NL 11378 22

6 PL 32 22

7 UA 9 22

# Calculate mean runs batted in for each year

rbi <- ddply(baseball, .(year), summarise,mean_rbi = mean(rbi, na.rm = TRUE))

head(rbi)

year mean_rbi

1 1871 22.28571

2 1872 20.53846

3 1873 30.92308

4 1874 29.00000

5 1875 31.58824

6 1876 30.13333

# Plot a line chart of the result

plot(mean_rbi ~ year, type = "l", data = rbi)

# make new variable career_year based on the

# start year for each player (id)

base2 <- ddply(baseball, .(id), mutate,career_year = year - min(year) + 1)

head(base2)

 id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp career_year

1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 1

2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 2

3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 3

4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 4

5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 5

6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 6

 

相关阅读:
Top