I have a dataframe as follows. It is ordered by column `time`.

Input -

``df = data.frame(time = 1:20,grp = sort(rep(1:5,4)),var1 = rep(c('A','B'),10))head(df,10)time grp var11 1 1 A2 2 1 B3 3 1 A4 4 1 B5 5 2 A6 6 2 B7 7 2 A8 8 2 B9 9 3 A10 10 3 B``

I want to create another variable `var2` which computes no of distinct `var1` values so far i.e. until that point in `time` for each group `grp` . This is a little different from what I'd get if I were to use `n_distinct`.

Expected output -

`` time grp var1 var21 1 1 A 12 2 1 B 23 3 1 A 24 4 1 B 25 5 2 A 16 6 2 B 27 7 2 A 28 8 2 B 29 9 3 A 110 10 3 B 2``

I want to create a function say `cum_n_distinct` for this and use it as -

``d_out = df %>%arrange(time) %>%group_by(grp) %>%mutate(var2 = cum_n_distinct(var1))``

Assuming stuff is ordered by `time` already, first define a cumulative distinct function:

``````dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
``````

Then a base solution that uses `ave` to create groups (note, assumes `var1` is factor), and then applies our function to each group:

``````transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
``````

A `data.table` solution, basically doing the same thing:

``````library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
``````

And `dplyr`, again, same thing:

``````library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
``````

### A `dplyr` solution inspired from @akrun's answer -

Ths logic is basically to set 1st occurrence of each unique values of `var1` to `1` and rest to `0` for each group `grp` and then apply `cumsum` on it -

``````df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)

head(df,10)

Source: local data frame [10 x 4]
Groups: grp

time grp var1 var2
1     1   1    A    1
2     2   1    B    2
3     3   1    A    2
4     4   1    B    2
5     5   2    A    1
6     6   2    B    2
7     7   2    A    2
8     8   2    B    2
9     9   3    A    1
10   10   3    B    2
``````

Try:

### Update

With your new dataset, an approach in base R

``````  df\$var2 <-  unlist(lapply(split(df, df\$grp),
function(x) {x\$var2 <-0
indx <- match(unique(x\$var1), x\$var1)
x\$var2[indx] <- 1
cumsum(x\$var2) }))

head(df,7)
#   time grp var1 var2
# 1    1   1    A    1
# 2    2   1    B    2
# 3    3   1    A    2
# 4    4   1    B    2
# 5    5   2    A    1
# 6    6   2    B    2
# 7    7   2    A    2
``````

Top