问题描述:

I have data from a telematics device with a driver ID, a timestamp and some sensor data which are not relevant for my example.

I want to create a route identifier and a record counter from this data in order to calculate statistics for each route for each driver. We are using pyspark 1.2.1 on a YARN cluster from the Hortonworks HDP 2.2.6.0 platform.

My data looks like this, it is in a pair RDD and these two elements are the key:

| driverID | timestamp |

| D1 | 1 |

| D1 | 2 |

| D1 | 6 |

| D1 | 8 |

| D2 | 1 |

| D2 | 3 |

| D2 | 4 |

| D2 | 7 |

I want to get the runID and sequenceID column, with the hypothesis that a lag of 3 time units starts a new run

| driverID | timestamp | runID | sequenceID

| D1 | 1 | 1 |1

| D1 | 2 | 1 |2

| D1 | 6 | 2 |1

| D1 | 8 | 2 |2

| D2 | 1 | 1 |1

| D2 | 3 | 1 |2

| D2 | 4 | 1 |3

| D2 | 7 | 2 |1

What would you suggest me to do ? This will be used ultimately with a terabyte size dataset. Driver ID is a string and timestamp is actually a datetime object.

thanks for your help

相关阅读:
Top