问题描述:

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:

ids = np.array([2, 0, 1, 0, 1, 1, 2])

times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])

These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do

for id in np.unique(ids):

diffs = np.diff(times[ids==id])

print diffs

# do stuff with diffs

However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?

网友答案:

You can use array.argsort() and ignore the values corresponding to change in ids:

>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)

or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:

times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence

and finally you can split this array with np.split and np.where:

np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])

As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.

网友答案:

Say you np.argsort by ids:

inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])

Now sort times by this, np.diff, and prepend a nan:

diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs 
array([ nan,  0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

These differences are correct except for the boundaries. Let's calculate those

boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False,  True, False,  True,  True, False,  True], dtype=bool)

Now we can just do

diffs[~boundaries] = np.nan

Let's see what we got:

>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])

>>> times[inds]
array([ 0.3,  0.5,  0.3,  0.6,  1.2,  0.1,  1.3])

>>> diffs
array([ nan,  0.2,  nan,  0.3,  0.6,  nan,  1.2])
网友答案:

The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:

import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)

Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

网友答案:

I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.

In pandas, you could do this in one step, after creating a DataFrame:

df = pd.DataFrame({'ids': ids, 'times': times})

df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)

This gives:

>>> df
   ids  times  diffs
0    2    0.1    NaN
1    0    0.3    NaN
2    1    0.3    NaN
3    0    0.5    0.2
4    1    0.6    0.3
5    1    1.2    0.6
6    2    1.3    1.2
相关阅读:
Top