问题描述:

I have parsed a text file pulling the relevant data. I then combined the variables(dlOrbit2, imageId3, imageStart4, imageEnd4)together to created a series of 4 strings in a list.

combined = str(','.join([dlOrbit2, imageId3, imageStart4, imageEnd4]))

strSplit = combined.split(',')

print strSplit

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']

['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37']

['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']

['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']

['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']

['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']

['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']

['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53']

['46290', '514628', '2016-10-26 13:12:54', '2016-10-26 13:13:13']

['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

I would like match and group elements in the first column. So, 46284 x 4, 46288 x 6, 46290 x 2, 46291 x 4. Within those groups I would like to have the earliest time from element 2 and the latest time from element 3. So desired output would be:

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']

['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57']

['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13']

['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

This list will always be 4 elements, however the # of the grouping elements (first column) will always be changing.

I am going to export these results into a CSV file. However, I only need help with the above section.

网友答案:

Use pandas:

import pandas as pd

dat = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

df = pd.DataFrame(dat).drop_duplicates()
df_times = df.groupby([0]).agg({2:min,3:max}).reset_index()
df_times.merge(df,on=[0,2])[[0,1,2,'3_x']]

Output:

0   46284   514607  2016-10-26 02:43:46 2016-10-26 02:48:39
1   46288   514626  2016-10-26 09:48:26 2016-10-26 09:54:57
2   46290   514628  2016-10-26 13:12:34 2016-10-26 13:13:13
3   46291   514738  2016-10-26 14:56:39 2016-10-26 14:59:06
网友答案:

As a newcomer to Python myself, I would like to see examples with base python functionality before using Big Hammers.

If it could be done without module imports in less than dozen lines of code I would expect to learn that 1st.

perhaps manipulating lists of lists with double indexing wasn't understood?

combined = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'], ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

combined[0][0]    # double index
Out[28]: '46284'

combined[2][2:]   # slice
Out[29]: ['2016-10-26 02:43:46', '2016-10-26 02:48:39']

max(combined[2][2:])    # duck type order comparison
Out[30]: '2016-10-26 02:48:39'

and why not def a function to use these basic Python tools on the input lists before the grouping?

网友答案:

You could leverage on groupby and tee:

data = [
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
    ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
]


from itertools import groupby, tee
import pprint

res = []
for k, g in groupby(data, key=lambda x: x[0]):
    it1, it2, it3 = tee(g, 3)
    res.append(next(it1)[:2] + [min(x[2] for x in it2), max(x[3] for x in it3)])

pprint.pprint(res)

Output:

[['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
 ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57'],
 ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13'],
 ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

for k, g in groupby(data, key=lambda x: x[0]) will group the consecutive rows based on the first column. It will return a tuples where first item is the key used for grouping and second is an iterator over the group items.

it1, it2, it3 = tee(g, 3) will split the group iterator to three iterators of which each will return exactly the same items. Finally the result is constructed by taking first two columns from first grouped item and running min & max over the two other iterators.

相关阅读:
Top