Consider the following R code (which, I think, eventually calls some Fortran):

``X <- 1:1000Y <- rep(1,1000)summary(lm(Y~X))``

Why are values returned by summary? Shouldn't this model fail to fit since there is no variance in Y? More importantly, why is the model R^2 ~= .5?

Edit

I tracked the code from lm to lm.fit and can see this call:

``z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),work = double(2 * p), PACKAGE = "base")``

That is where the actual fit seems to happen. Looking at http://svn.r-project.org/R/trunk/src/appl/dqrls.f) did not help me understand what is going on, because I do not know fortran.

Statistically speaking, what should we anticipate (I'd like to say "expect", but that's a very specific term ;-))? The coefficients should be (0,1), rather than "fail to fit". The covariance of (X,Y) is assumed proportional to the variance of X, not the other way around. As X has non-zero variance, there is no problem. As the covariance is 0, the estimated coefficient for X should be 0. So, within machine tolerance, this is the answer you're getting.

There is no statistical anomaly here. There may be a statistical misunderstanding. There's also the issue of machine tolerance, but a coefficient on the order of 1E-19 is rather negligible, given the scale of the predictor and response values.

Update 1: A quick review of simple linear regression can be found on this Wikipedia page. The key thing to note is that `Var(x)` is in the denominator, `Cov(x,y)` in the numerator. In this case, the numerator is 0, the denominator is non-zero, so there is no reason to expect a `NaN` or `NA`. However, one may ask why isn't the resulting coefficient for `x` a `0`, and that has to do with numerical precision issues of the QR decomposition.

I believe this is simply because the QR decomposition is implemented with floating point arithmetic.

The `singular.ok` parameter actually refers to the design matrix (i.e. X only). Try

``````lm.fit(cbind(X, X), Y)
``````

vs.

``````lm.fit(cbind(X, X), Y, singular.ok=F)
``````

I agree that the problem might be of floating point. but I don't think is singularity.

If you check using `solve(t(x1)%*%x1)%*%(t(x1)%*%Y)` instead of QR, `(t(x1)%*%x1)` is not singular

use `x1 = cbind(rep(1,1000,X)` because `lm(Y~X)` includes the intercept.

Top