残差可视化

来源:互联网 时间:2016-08-25

简介

在统计模型中残差对于鉴别统计模型的潜在模型是一个十分重要的元素。比如:线性回归模型就要求残差是同方差的,如果不是,则表明该模型不太准确,有可能是非线性数据。本文基于回归模型使用不同的方法对残差进行可视化。

涉及知识点

  • 熟悉残差的概念和求解过程;
  • 会使用R语言进行回归(线性和逻辑)分析,并且使用ggplot2进行可视化,会使用dplyr和tidyr包进行数据操作。

已能显示的可视化信息

在R中可以直接使用plot()绘制合适的回归模型。下面是用mtcars数据集进行操作,对每公里加仑数(miles per gallon for each car (mpg))和马力(horsepower (hp))进行回归分析并可视化模型和残差的信息。

代码:

fit <- lm(mpg ~ hp, data = mtcars) # 拟合模型

summary(fit) # 结果报告

#>

#> Call:

#> lm(formula = mpg ~ hp, data = mtcars)

#>

#> Residuals:

#> Min 1Q Median 3Q Max

#> -5.7121 -2.1122 -0.8854 1.5819 8.2360

#>

#> Coefficients:

#> Estimate Std. Error t value Pr(>|t|)

#> (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***

#> hp -0.06823 0.01012 -6.742 1.79e-07 ***

#> ---

#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#>

#> Residual standard error: 3.863 on 30 degrees of freedom

#> Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892

#> F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid

plot(fit) # Plot the model information

par(mfrow = c(1, 1)) # Return plotting panel to 1 section

上面的模型使用的传统的方法来解释残差项并解释模型是否具有问题。现在,我们考虑对图形做一些改变(和更俱视觉吸引力)来进行补充。

步骤

  1. 找到一个回归模型来预测变量Y;
  2. 获得预测值和残差值与Y的每个观测值的关联;
  3. 对Y 的实际值和预测值进行可视化,以至于他们可区分,但又具有联系;
  4. 使用残差做美学调整(例如:用红色表示残差非常高的值)来高亮模型的较差预测的点。

简单线性回归

Step1:fit the model拟合模型

首先,我们血需要拟合我们的模型。在本实例中,复制mtcars数据集到新的对象d中,如下面的操作:

d <- mtcars

fit <- lm(mpg ~ hp, data = d)

summary(fit)

## Call:

## lm(formula = mpg ~ hp, data = d)

##

## Residuals:

## Min 1Q Median 3Q Max

## -5.7121 -2.1122 -0.8854 1.5819 8.2360

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***

## hp -0.06823 0.01012 -6.742 1.79e-07 ***

## ---

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

##

## Residual standard error: 3.863 on 30 degrees of freedom

## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892

## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

Step2:obtain predicted and residual calues获得预测值和残差值

d$predicted <- predict(fit) # 保存预测值

d$residuals <- residuals(fit) # 保存残差值

# 使用dplyr包快速查看实际值、预测值和残差值

library(dplyr)

d %>% select(mpg, predicted, residuals) %>% head()

#> mpg predicted residuals

#> Mazda RX4 21.0 22.59375 -1.5937500

#> Mazda RX4 Wag 21.0 22.59375 -1.5937500

#> Datsun 710 22.8 23.75363 -0.9536307

#> Hornet 4 Drive 21.4 22.59375 -1.1937500

#> Hornet Sportabout 18.7 18.15891 0.5410881

#> Valiant 18.1 22.93489 -4.8348913

Step3:plot the actual and predicted calues可视化实际值和预测值

绘制实际值散点图:

library(ggplot2)

ggplot(d, aes(x = hp, y = mpg)) + # 设置画布与结果变量y轴

geom_point() # 绘实际值散点

绘制预测值散点图:

ggplot(d, aes(x = hp, y = mpg)) +

geom_point() +

geom_point(aes(y = predicted), shape = 1) # Add the predicted values

上面的图很难看出实际值和预测值关联,使用geom_segment()可以将实际值点和他么的预测值进行对应:

ggplot(d, aes(x = hp, y = mpg)) +

geom_segment(aes(xend = hp, yend = predicted)) +

geom_point() +

geom_point(aes(y = predicted), shape = 1)

根据ggplot2的一些参数进行最后的调整,包括:

  • 主题,清除全部外观theme_bw();
  • 通过调整alpha透明度淡出连接线;
  • 使用geom_smooth()添加回归斜率。

library(ggplot2)

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines

geom_point() +

geom_point(aes(y = predicted), shape = 1) +

theme_bw() # Add theme for cleaner look

Setp4:use residuals to adjust使用残差调整

最后,我们希望通过残差的值来高亮其值得大小。有很多种方法可以达到目的。下面的代码给出了解决方案:

# ALPHA

# 同过残差的绝对值来改变实际值得alpha值

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

# > Alpha adjustments made here...

geom_point(aes(alpha = abs(residuals))) + # Alpha mapped to abs(residuals)

guides(alpha = FALSE) + # Alpha legend removed

# <

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

 

可以看到,残差值越小,散点就越透明。

 

# COLOR

# High residuals (in abolsute terms) made more red on actual values.

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

# > Color adjustments made here...

geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)

scale_color_continuous(low = "black", high = "red") + # Colors to use here

guides(color = FALSE) + # Color legend removed

# <

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

# SIZE AND COLOR

# Same coloring as above, size corresponding as well

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

# > Color AND size adjustments made here...

geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped

scale_color_continuous(low = "black", high = "red") +

guides(color = FALSE, size = FALSE) + # Size legend also removed

# <

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

# COLOR UNDER/OVER

# Color mapped to residual with sign taken into account.

# i.e., whether actual value is greater or less than predicted

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

# > Color adjustments made here...

geom_point(aes(color = residuals)) + # Color mapped here

scale_color_gradient2(low = "blue", mid = "white", high = "red") + # Colors to use here

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

最后一个实例非常实用,因为颜色能够帮助确定数据的非线性。如例子中所示,从图中可以发现hp较多的红色的极值(实际值比预测值大)。而在hp的中间值部分却有更多的蓝色,表明这些点处的实际值小于预测值。总的来说,表明两个变量间的关系是非线性的,在回归方程中添加二次项可能会更好的建模。

多元回归

该实例中,我们利用马力(hp)、重量(wt)和位移(disp)对每公里加仑数据(mpg)进行回归:

# 选择感兴趣的数据

library(dplyr)

d <- mtcars %>% select(mpg, hp, wt, disp)

# 利用多元回归拟合模型

fit <- lm(mpg ~ hp + wt+ disp, data = d)

# 获得预测值和残差值

d$predicted <- predict(fit)

d$residuals <- residuals(fit)

head(d)

#> mpg hp wt disp predicted residuals

#> Mazda RX4 21.0 110 2.620 160 23.57003 -2.5700299

#> Mazda RX4 Wag 21.0 110 2.875 160 22.60080 -1.6008028

#> Datsun 710 22.8 93 2.320 108 25.28868 -2.4886829

#> Hornet 4 Drive 21.4 110 3.215 258 21.21667 0.1833269

#> Hornet Sportabout 18.7 175 3.440 360 18.24072 0.4592780

#> Valiant 18.1 105 3.460 225 20.47216 -2.3721590

绘制实际值和预测值得散点图,并清除背景和将其连接:

library(ggplot2)

ggplot(d, aes(x = hp, y = mpg)) +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # Lines to connect points

geom_point() + # Points of actual values

geom_point(aes(y = predicted), shape = 1) + # Points of predicted values

theme_bw()

同样,我们可以使用残差对图形进行优化,以便于理解:

ggplot(d, aes(x = hp, y = mpg)) +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

geom_point(aes(color = residuals)) +

scale_color_gradient2(low = "blue", mid = "white", high = "red") +

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

到目前为止,代码同前基本没有多大的改变,可是,由于是多元回归,预测值不再是线性整齐的排列了。

一次使用多个预测椅子绘图

绘制一个独立的变量是不错的,但是,多元回归的目的是研究多个变量。为了能够同时使用多个预测因子,这里有个技巧,就是使用tidyr包中的gather()来将宽数据变为长数据,多列变为一列,然后利用ggplot中的facet_*()分面来将面板分开显示。

如果想要了解更多上面的知识可以点击下面的连接进行阅读:

Plotting background data for groups with ggplot2

Plot some variables against many others with tidyr and ggplot2

Quick plot of all variables

OK,现在我们来实现上面的要求:

library(tidyr)

d %>%

gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>% # 数据整形,将hp wt disp变为一列

ggplot(aes(x = x, y = mpg)) + # Note use of `x` here and next line

geom_segment(aes(xend = x, yend = predicted), alpha = .2) +

geom_point(aes(color = residuals)) +

scale_color_gradient2(low = "blue", mid = "white", high = "red") +

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

facet_grid(~ iv, scales = "free_x") + # Split panels here by `iv`

theme_bw()

现在我们可以试一下其他的数据集,比如,iris数据集,利用除了Sepal.Width变量的其他连续型变量来回归Sepal.Width变量:

head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

d <- iris %>% select(-Species)

# Fit the model

fit <- lm(Sepal.Width ~ ., data = iris)

# Obtain predicted and residual values

d$predicted <- predict(fit)

d$residuals <- residuals(fit)

# Create plot

d %>%

gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%

ggplot(aes(x = x, y = Sepal.Width)) +

geom_segment(aes(xend = x, yend = predicted), alpha = .2) +

geom_point(aes(color = residuals)) +

scale_color_gradient2(low = "blue", mid = "white", high = "red") +

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

facet_grid(~ iv, scales = "free_x") +

theme_bw()

通过上面的图可以比较变量的实际值和预测值。值得注意的是有颜色的点为实际值,灰度标记的则为预测值。不出所料,相比于前面单一的回归,多元回归的预测值相对于实际值的变化减少了。

逻辑回归

现在扩展到逻辑回归,要求一些基本的工作流(前面讲过的步骤),然后需要提取相应变量的预测值和残差值。下面的实例利用hp来预测V/S(vs),vs为二值元素,只有0和1:

# Step 1: Fit the data

d <- mtcars

head(d)

## mpg cyl disp hp drat wt qsec vs am gear carb predicted

## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.69769525

## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.69769525

## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.88099419

## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.69769525

## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.02608156

## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.76479499

## residuals

## Mazda RX4 -0.69769525

## Mazda RX4 Wag -0.69769525

## Datsun 710 0.11900581

## Hornet 4 Drive 0.30230475

## Hornet Sportabout -0.02608156

## Valiant 0.23520501

fit <- glm(vs ~ hp, family = binomial(), data = d)

# Step 2: Obtain predicted and residuals

d$predicted <- predict(fit, type="response")

d$residuals <- residuals(fit, type = "response")

# Steps 3 and 4: plot the results

ggplot(d, aes(x = hp, y = vs)) +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

geom_point(aes(color = residuals)) +

scale_color_gradient2(low = "blue", mid = "white", high = "red") +

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

如果我们只是想要标记那些预测错误的类别,可以利用dplyr中的filter()函数实现:

ggplot(d, aes(x = hp, y = vs)) +

geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

geom_point() +

# > This plots large red circle on misclassified points

geom_point(data = d %>% filter(vs != round(predicted)),

color = "red", size = 2) +

scale_color_gradient2(low = "blue", mid = "white", high = "red") +

guides(color = FALSE) +

geom_point(aes(y = predicted), shape = 1) +

theme_bw()

扩展:使用broom包

使用broom包可以简化以上步骤,将步骤1和步骤2合并。broom包的作用是可以将统计分析对象转换成R中整齐的数据框结构。在本实例中,函数augment()的作用是将拟合好的回归模型转换成为可用的预测值和残差值得数据框,但是,需要注意以下几点:

  • 预测值和残差值的命名变为.fitted和.resid
  • 我们还会获得额外的其他的变量,这些变量需要删除,可以使用gather()函数和facet_*()函数实计算机生成了可选文字:</p><p>

# install.packages('broom')

library(broom)

library(dplyr)

# Steps 1 and 2

d <- lm(mpg ~ hp, data = mtcars) %>%

augment()

head(d)

#> .rownames mpg hp .fitted .se.fit .resid .hat

#> 1 Mazda RX4 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627

#> 2 Mazda RX4 Wag 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627

#> 3 Datsun 710 22.8 93 23.75363 0.8726286 -0.9536307 0.05102911

#> 4 Hornet 4 Drive 21.4 110 22.59375 0.7772744 -1.1937500 0.04048627

#> 5 Hornet Sportabout 18.7 175 18.15891 0.7405479 0.5410881 0.03675069

#> 6 Valiant 18.1 105 22.93489 0.8026728 -4.8348913 0.04317538

#> .sigma .cooksd .std.resid

#> 1 3.917367 0.0037426122 -0.4211862

#> 2 3.917367 0.0037426122 -0.4211862

#> 3 3.924793 0.0017266396 -0.2534156

#> 4 3.922478 0.0020997190 -0.3154767

#> 5 3.927667 0.0003885555 0.1427178

#> 6 3.820288 0.0369380489 -1.2795289

# Steps 3 and 4

ggplot(d, aes(x = hp, y = mpg)) +

geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +

geom_segment(aes(xend = hp, yend = .fitted), alpha = .2) + # Note `.fitted`

geom_point(aes(alpha = abs(.resid))) + # Note `.resid`

guides(alpha = FALSE) +

geom_point(aes(y = .fitted), shape = 1) + # Note `.fitted`

theme_bw()

本文链接:http://www.cnblogs.com/homewch/p/5806405.html

原文链接:https://drsimonj.svbtle.com/visualising-residuals

相关阅读:
Top