I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

``s<-"no Spc PSize2 0 64932 0 92812 12 261832 12 361802 12 378062 12 377653 12 360153 12 266613 0 140313 0 55643 1 177013 1 208083 1 315113 1 447463 1 505343 1 548583 1 581603 1 60326"d<-read.delim(textConnection(s),sep="",header=T)mean(d[1:10,3])sd(d[1:10,3])``

From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

``mean(d[11:18,3])sd(d[11:18,3])``

I can calculate it by hand, but that is not the idea...

You can do it by adding a column that checks whether or not the entry matches the value above, then use `cumsum` to find the places where the count changes. I `group_by`'d that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc)
, whichGroup = cumsum(isDiff)) %>%
group_by(whichGroup, Spc) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

Gives:

``````  whichGroup   Spc    mean        sd whichRows
<int> <int>   <dbl>     <dbl>     <chr>
1          0     0  7887.0  1971.414       1:2
2          1    12 33435.0  5486.794       3:8
3          2     0  9797.5  5987.073      9:10
4          3     1 42330.5 16866.591     11:18
``````

If you only want the last group, which I can't tell from your post if you do or not, you could instead use `filter`, like this:

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc)
, whichGroup = cumsum(isDiff)) %>%
filter(whichGroup == max(whichGroup)) %>%
summarise(Spc = Spc
, mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

Which gives:

``````  Spc    mean       sd whichRows
1   1 42330.5 16866.59     11:18
``````

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc)
, whichGroup = cumsum(isDiff)) %>%
group_by(isLast = whichGroup == max(whichGroup)) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

which gives:

``````  isLast    mean       sd whichRows
<lgl>   <dbl>    <dbl>     <chr>
1  FALSE 23597.9 13521.32      1:10
2   TRUE 42330.5 16866.59     11:18
``````

Option 1: Using `rleid` from the `data.table` package:

``````d %>%
group_by(rlid = rleid(Spc)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize)) %>%
slice(n())
``````

gives:

``````# A tibble: 1 × 3
rlid mean_size  sd_size
<int>     <dbl>    <dbl>
1     4   42330.5 16866.59
``````

Option 2: Using `rle`:

``````startrow <- sum(head(rle(d\$Spc)\$lengths, -1)) + 1
d %>%
slice(startrow:n()) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
``````

gives:

``````  mean_size  sd_size
1   42330.5 16866.59
``````

Option 3: If you want to calculate for two groups (last and others) you should use `group_by` instead of `filter` and create a new grouping vector (`rep_vec`) with `rle`:

``````rep_vec <- c(sum(head(rle(d\$Spc)\$lengths, -1)), tail(rle(d\$Spc)\$lengths, 1))

d %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
``````

which gives:

``````         grp mean_size  sd_size
(chr)     (dbl)    (dbl)
1 last_group   42330.5 16866.59
2     others   23597.9 13521.32
``````

If you want to include the rows, you can change the code to:

``````d %>%
mutate(rn = row_number()) %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize), rows = paste0(range(rn), collapse=':'))
``````

which gives:

``````         grp mean_size  sd_size  rows
<chr>     <dbl>    <dbl> <chr>
1 last_group   42330.5 16866.59 11:18
2     others   23597.9 13521.32  1:10
``````

So you want to find the index where the middle vector starts to be constant? You can take the `diff()` of your vector and look for the first time this is different from zero. For example,

``````vec <- c(1,2,3,4,5,5,5,6,6,6)
diff(vec)
differences <- rev(diff(vec))

# distance from the end of first non-zero
min.dist <- min(which(differences != 0))

# take difference
length(vec) - min.dist + 1
``````

That last value should give you the index where the vector starts to be constant.

Top