问题描述:

Consider the following R code (which, I think, eventually calls some Fortran):

`X <- 1:1000`

Y <- rep(1,1000)

summary(lm(Y~X))

Why are values returned by summary? Shouldn't this model fail to fit since there is no variance in Y? More importantly, why is the model R^2 ~= .5?

**Edit**

I tracked the code from lm to lm.fit and can see this call:

`z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,`

tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,

effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),

work = double(2 * p), PACKAGE = "base")

That is where the actual fit seems to happen. Looking at http://svn.r-project.org/R/trunk/src/appl/dqrls.f) did not help me understand what is going on, because I do not know fortran.

Statistically speaking, what should we anticipate (I'd like to say "expect", but that's a very specific term ;-))? The coefficients should be (0,1), rather than "fail to fit". The covariance of (X,Y) is assumed proportional to the variance of X, not the other way around. As X has non-zero variance, there is no problem. As the covariance is 0, the estimated coefficient for X should be 0. So, within machine tolerance, this is the answer you're getting.

There is no statistical anomaly here. There may be a statistical misunderstanding. There's also the issue of machine tolerance, but a coefficient on the order of 1E-19 is rather negligible, given the scale of the predictor and response values.

Update 1: A quick review of simple linear regression can be found on this Wikipedia page. The key thing to note is that `Var(x)`

is in the denominator, `Cov(x,y)`

in the numerator. In this case, the numerator is 0, the denominator is non-zero, so there is no reason to expect a `NaN`

or `NA`

. However, one may ask why isn't the resulting coefficient for `x`

a `0`

, and that has to do with numerical precision issues of the QR decomposition.

I believe this is simply because the QR decomposition is implemented with floating point arithmetic.

The `singular.ok`

parameter actually refers to the design matrix (i.e. X only). Try

```
lm.fit(cbind(X, X), Y)
```

vs.

```
lm.fit(cbind(X, X), Y, singular.ok=F)
```

I agree that the problem might be of floating point. but I don't think is singularity.

If you check using `solve(t(x1)%*%x1)%*%(t(x1)%*%Y)`

instead of QR, `(t(x1)%*%x1)`

is not singular

use `x1 = cbind(rep(1,1000,X)`

because `lm(Y~X)`

includes the intercept.