问题描述:

## Update:

How does one express a linear model where observations can belong to multiple categories and the number of categories is large?

For example, using time dummies as the categories, here is a problem that is easy to set up since the number of categories (time periods) is small and known:

`tmp <- "day 1, day 2`

0,1

1,0

1,1"

periods <- read.csv(text = tmp)

y <- rnorm(3)

print(lm(y ~ day.1 + day.2 + 0, data=periods))

Now suppose that instead of two days there were 100. Would I need to create a formula like the following?

`y ~ day.1 + day.2 + ... + day.100 + 0`

Presumably such a formula would have to be created programmatically. This seems inelegant and un-R-like.

What is the right R way to tackle this? For example, aside from the formula problem, is there a better way to create the dummies than creating a matrix of 1s and 0s (as I did above)? For the sake of concreteness, say that the actual data consists (for each observation) of a start and end date (so that `tmp`

would contain a 1 in each column between start and end).

Based on the answer of @jlhoward, here is a larger example:

`num.observations <- 1000`

# Manually create 100 columns of dummies called x1, ..., x100

periods <- data.frame(1*matrix(runif(num.observations*100) > 0.5, nrow = num.observations))

y <- rnorm(num.observations)

print(summary(lm(y ~ ., data = periods)))

It illustrates the manual creation of a data frame of dummies (1s and 0s). I would be interested in learning whether there is a more R-like way of dealing with these "multiple dummies per observation" issue.

You can use the `.`

notation to include all variables other than the response in a formula, and `-1`

to remove the intercept. Also, put everything in your data frame; don't make `y`

a separate vector.

```
set.seed(1) # for reproducibility
df <- data.frame(y=rnorm(3),read.csv(text=tmp))
fit.1 <- lm(y ~ day.1 + day.2 + 0, df)
fit.2 <- lm(y ~ -1 + ., df)
identical(coef(fit.1),coef(fit.2))
# [1] TRUE
```