问题描述:

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:

`ids = np.array([2, 0, 1, 0, 1, 1, 2])`

times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])

These arrays are always of the same size. Now I need to calculate the differences of `times`

, but only for those times with the same `ids`

. Of course, I can simply loop over the different `ids`

an do

`for id in np.unique(ids):`

diffs = np.diff(times[ids==id])

print diffs

# do stuff with diffs

However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?

You can use `array.argsort()`

and ignore the values corresponding to change in ids:

```
>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
```

To see which values you need to discard, you could use a Counter to count the number of times per id (`from collections import Counter`

)

or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:

```
times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
```

and finally you can split this array with np.split and np.where:

```
np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])
```

As you mentionned in your comment, `argsort()`

default algorithm (quicksort) might not preserve order between equals times, so the `argsort(kind='mergesort')`

option must be used.

Say you `np.argsort`

by `ids`

:

```
inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
```

Now sort `times`

by this, `np.diff`

, and prepend a `nan`

:

```
diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan, 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
```

These differences are correct except for the boundaries. Let's calculate those

```
boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False, True, False, True, True, False, True], dtype=bool)
```

Now we can just do

```
diffs[~boundaries] = np.nan
```

Let's see what we got:

```
>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])
>>> times[inds]
array([ 0.3, 0.5, 0.3, 0.6, 1.2, 0.1, 1.3])
>>> diffs
array([ nan, 0.2, nan, 0.3, 0.6, nan, 1.2])
```

The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:

```
import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
```

Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

I'm adding another answer, since, even though these things are possible in `numpy`

, I think that the higher-level `pandas`

is much more natural for them.

In `pandas`

, you could do this in one step, after creating a DataFrame:

```
df = pd.DataFrame({'ids': ids, 'times': times})
df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
```

This gives:

```
>>> df
ids times diffs
0 2 0.1 NaN
1 0 0.3 NaN
2 1 0.3 NaN
3 0 0.5 0.2
4 1 0.6 0.3
5 1 1.2 0.6
6 2 1.3 1.2
```