# R语言：常用统计检验

1. 建立假设
2. 求抽样分布
3. 选择显著性水平和否定域
4. 计算检验统计量
5. 判定 —— 百度百科

## 正态总体均值的假设检验

### t检验

t.test() => Student's t-Test

``require(graphics)t.test(1:10, y = c(7:20)) # P = .00001855t.test(1:10, y = c(7:20, 200)) # P = .1245 -- 不在显著``

``## 经典案例: 学生犯困数据plot(extra ~ group, data = sleep)``

``## 传统表达式with(sleep, t.test(extra[group == 1], extra[group == 2]))Welch Two Sample t-testdata: extra[group == 1] and extra[group == 2]t = -1.8608, df = 17.776, p-value = 0.07939alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-3.3654832 0.2054832sample estimates:mean of x mean of y0.75 2.33## 公式形式t.test(extra ~ group, data = sleep)Welch Two Sample t-testdata: extra by groupt = -1.8608, df = 17.776, p-value = 0.07939alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-3.3654832 0.2054832sample estimates:mean in group 1 mean in group 20.75 2.33``

• 某种元件的寿命X（小时）服从正态分布N（mu,sigma^2），其中mu、sigma^2均未知，16只元件的寿命如下；问是否有理由认为元件的平均寿命大于255小时。

``X<-c(159, 280, 101, 212, 224, 379, 179, 264,222, 362, 168, 250, 149, 260, 485, 170)t.test(X, alternative = "greater", mu = 225)One Sample t-testdata: Xt = 0.66852, df = 15, p-value = 0.257alternative hypothesis: true mean is greater than 22595 percent confidence interval:198.2321 Infsample estimates:mean of x241.5``

• X为旧炼钢炉出炉率，Y为新炼钢炉出炉率，问新的操作能否提高出炉率？

``X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)t.test(X, Y, var.equal=TRUE, alternative = "less")Two Sample t-testdata: X and Yt = -4.2957, df = 18, p-value = 0.0002176alternative hypothesis: true difference in means is less than 095 percent confidence interval:-Inf -1.908255sample estimates:mean of x mean of y``76.23 79.43 ``

• 对每个高炉进行配对t检验

``X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)t.test(X-Y, alternative = "less")One Sample t-testdata: X - Yt = -4.2018, df = 9, p-value = 0.00115alternative hypothesis: true mean is less than 095 percent confidence interval:-Inf -1.803943sample estimates:mean of x``-3.2 ``

## 正态总体方差的假设检验

var.test() => F Test to Compare Two Variances

``x <- rnorm(50, mean = 0, sd = 2)y <- rnorm(30, mean = 1, sd = 1)var.test(x, y) # x和y的方差是否相同？var.test(lm(x ~ 1), lm(y ~ 1)) # 相同.``

• 从小学5年级男生中抽取20名，测量其身高（厘米）如下；问：在0.05显著性水平下，平均值是否等于149，sigma^2是否等于75？

``X<-scan()136 144 143 157 137 159 135 158 147 165158 142 159 150 156 152 140 149 148 155var.test(X,Y)F test to compare two variancesdata: X and YF = 34.945, num df = 19, denom df = 9, p-value = 6.721e-06alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:9.487287 100.643093sample estimates:ratio of variances34.94489``

• 对炼钢炉的数据进行分析

``X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)var.test(X,Y)F test to compare two variancesdata: X and YF = 1.4945, num df = 9, denom df = 9, p-value = 0.559alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.3712079 6.0167710sample estimates:ratio of variances1.494481``

## 二项分布的总体检验

• 有一批蔬菜种子的平均发芽率为P=0.85,现在随机抽取500粒，用种衣剂进行浸种处理，结果有445粒发芽，问种衣剂有无效果。

``binom.test(445,500,p=0.85)Exact binomial testdata: 445 and 500number of successes = 445, number of trials = 500, p-value = 0.01207alternative hypothesis: true probability of success is not equal to 0.8595 percent confidence interval:0.8592342 0.9160509sample estimates:probability of success0.89``

• 按照以往经验，新生儿染色体异常率一般为1%，某医院观察了当地400名新生儿，有一例染色体异常，问该地区新生儿染色体是否低于一般水平？

``binom.test(1,400,p=0.01,alternative="less")Exact binomial testdata: 1 and 400number of successes = 1, number of trials = 400, p-value = 0.09048alternative hypothesis: true probability of success is less than 0.0195 percent confidence interval:0.0000000 0.0118043sample estimates:probability of success0.0025``

## 非参数检验

### 数据是否正态分布的Neyman-Pearson 拟合优度检验-chisq

• 5种品牌啤酒爱好者的人数如下

A 210

B 312

C 170

D 85

E 223

问不同品牌啤酒爱好者人数之间有没有差异？

``X<-c(210, 312, 170, 85, 223)chisq.test(X)Chi-squared test for given probabilitiesdata: X``X-squared = 136.49, df = 4, p-value < 2.2e-16``

• 检验学生成绩是否符合正态分布

``X<-scan()25 45 50 54 55 61 64 68 72 75 7578 79 81 83 84 84 84 85 86 86 8687 89 89 89 90 91 91 92 100A<-table(cut(X, br=c(0,69,79,89,100)))#cut 将变量区域划分为若干区间#table 计算因子合并后的个数p<-pnorm(c(70,80,90,100), mean(X), sd(X))p<-c(p[1], p[2]-p[1], p[3]-p[2], 1-p[3])chisq.test(A,p=p)Chi-squared test for given probabilitiesdata: AX-squared = 8.334, df = 3, p-value = 0.03959``#均值之间有无显著区别``

``chisq.test(c(335, 125, 160), p=c(9,3,4)/16)Chi-squared test for given probabilitiesdata: c(335, 125, 160)``X-squared = 1.362, df = 2, p-value = 0.5061``

• 现有42个数据，分别表示某一时间段内电话总机借到呼叫的次数，

接到呼叫的次数 0 1 2 3 4 5 6

出现的频率 7 10 12 8 3 2 0

问：某个时间段内接到的呼叫次数是否符合Possion分布？

``x<-0:6y<-c(7,10,12,8,3,2,0)mean<-mean(rep(x,y))q<-ppois(x,mean)n<-length(y)p[1]<-q[1]p[n]<-1-q[n-1]for(i in 2:(n-1))p[i]<-1-q[i-1]chisq.test(y, p= rep(1/length(y), length(y)) )Chi-squared test for given probabilitiesdata: yX-squared = 19.667, df = 6, p-value = 0.003174Z<-c(7, 10, 12, 8)n<-length(Z); p<-p[1:n-1]; p[n]<-1-q[n-1]chisq.test(Z, p= rep(1/length(Z), length(Z)))Chi-squared test for given probabilitiesdata: ZX-squared = 1.5946, df = 3, p-value = 0.6606``

P值越小越有理由拒绝无效假设，认为总体之间有差别的统计学证据越充分。需要注意：不拒绝H0不等于支持H0成立，仅表示现有样本信息不足以拒绝H0。

Top