# R语言数据处理包dplyr、tidyr笔记

• 筛选: filter()
• 排列: arrange()
• 选择: select()
• 变形: mutate()
• 汇总: summarise()
• 分组: group_by()

• gather—宽数据转为长数据；
• unit—多列合并为一列；
• separate—将一列分离为多列；

### dplyr、tidyr包安装及载入

``install.packages("dplyr")install.packages("tidyr")library(dplyr)``library(tidyr)``

``mtcars_df = tbl_df(mtcars)``

### dplyr包基本操作

#### 1.1 筛选: filter()

``filter(mtcars_df,mpg==21,hp==110)# A tibble: 2 x 11mpg cyl disp hp drat wt qsec vs am gear carb<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>1 21 6 160 110 3.9 2.620 16.46 0 1 4 4``2 21 6 160 110 3.9 2.875 17.02 0 1 4 4``

#### 1.2 排列: arrange()

``arrange(mtcars_df, disp) #可对列名加 desc(disp) 进行倒序# A tibble: 32 x 11mpg cyl disp hp drat wt qsec vs am gear carb<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 12 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 23 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 14 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 15 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 26 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 17 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 18 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 29 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 210 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2# ... with 22 more rows``

#### 1.3 选择: select()

``select(mtcars_df, disp:wt)# A tibble: 32 x 4disp hp drat wt* <dbl> <dbl> <dbl> <dbl>1 160.0 110 3.90 2.6202 160.0 110 3.90 2.8753 108.0 93 3.85 2.3204 258.0 110 3.08 3.2155 360.0 175 3.15 3.4406 225.0 105 2.76 3.4607 360.0 245 3.21 3.5708 146.7 62 3.69 3.1909 140.8 95 3.92 3.15010 167.6 123 3.92 3.440# ... with 22 more rows``

#### 1.4 变形: mutate()

``mutate(mtcars_df,NO = 1:dim(mtcars_df)[1])# A tibble: 32 x 12mpg cyl disp hp drat wt qsec vs am gear carb NO<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 12 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 23 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 34 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 45 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 56 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 67 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 78 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 89 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 910 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10# ... with 22 more rows``

#### 1.5 汇总: summarise()

``summarise(mtcars_df,mdisp = mean(disp, na.rm = TRUE))# A tibble: 1 x 1mdisp<dbl>1 230.7219``

#### 1.6 分组: group_by()

``cars <- group_by(mtcars_df, cyl)countcars <- summarise(cars, count = n()) # count = n()用来计算次数# A tibble: 3 x 2cyl count<dbl> <int>1 4 112 6 7``3 8 14``

### tidyr包基本操作

#### 2.1 宽转长：gather()

``gather(data, key, value, …, na.rm = FALSE, convert = FALSE)data：需要被转换的宽形表key：将原数据框中的所有列赋给一个新变量keyvalue：将原数据框中的所有值赋给一个新变量value…：可以指定哪些列聚到同一列中na.rm：是否删除缺失值widedata <- data.frame(person=c('Alex','Bob','Cathy'),grade=c(2,3,4),score=c(78,89,88))widedataperson grade score1 Alex 2 782 Bob 3 893 Cathy 4 88longdata <- gather(widedata, variable, value,-person)longdataperson variable value1 Alex grade 22 Bob grade 33 Cathy grade 44 Alex score 785 Bob score 89``6 Cathy score 88``

``spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)data：为需要转换的长形表key：需要将变量值拓展为字段的变量value：需要分散的值fill：对于缺失值，可将fill的值赋值给被转型后的缺失值mtcarsSpread <- mtcarsNew %>% spread(attribute, value)head(mtcarsSpread)car am carb cyl disp drat gear hp mpg qsec vs wt1 AMC Javelin 0 2 8 304 3.15 3 150 15.2 17.30 0 3.4352 Cadillac Fleetwood 0 4 8 472 2.93 3 205 10.4 17.98 0 5.2503 Camaro Z28 0 4 8 350 3.73 3 245 13.3 15.41 0 3.8404 Chrysler Imperial 0 4 8 440 3.23 3 230 14.7 17.42 0 5.3455 Datsun 710 1 1 4 108 3.85 4 93 22.8 18.61 1 2.3206 Dodge Challenger 0 2 8 318 2.76 3 150 15.5 16.87 0 3.520``

#### 2.3 合并：unit()

unite的调用格式如下：

``unite(data, col, …, sep = “_”, remove = TRUE)data：为数据框col：被组合的新列名称…：指定哪些列需要被组合sep：组合列之间的连接符，默认为下划线remove：是否删除被组合的列wideunite<-unite(widedata, information, person, grade, score, sep= "-")wideuniteinformation1 Alex-2-782 Bob-3-89``3 Cathy-4-88``

#### 2.4 拆分：separate()

separate()函数可将一列拆分为多列，一般可用于日志数据或日期时间型数据的拆分，语法如下：

``separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE,convert = FALSE, extra = “warn”, fill = “warn”, …)data：为数据框col：需要被拆分的列into：新建的列名，为字符串向量sep：被拆分列的分隔符remove：是否删除被分割的列widesep <- separate(wideunite, information,c("person","grade","score"), sep = "-")widesepperson grade score1 Alex 2 782 Bob 3 89``3 Cathy 4 88``

Top