问题描述:

I am using the pyspark and pandas api in python for analyzing sales data. I have a generator_object of a data_list where each element in the data_list is an obj. Each obj consists of a key(string) and

data (pandas groupBy object). Code for the same is given below. Sorry if it seems ambiguous.

1) I need to map each element of the data_list to a calculation function using pyspark so that the needed calculations can be performed on the grouped data. I don't know how to go about it beyond creating the generator_object.

2) The calculation function would return a pyspark dataframe. I need to merge all the dataframes after the parallel computation over the cluster into one final_dataframe. How do I accomplish this?

Details : I have a pandas dataframe with 8 columns :country, state, city, zone, date, sale_no, sale_amt, discount.

I grouped them the by the first four columns to generate a unique key for each zone.

columns = ['L1','L2','L3','L4']

grouped_df_pandas = df_pandas.groupby(columns,sort=False)

obj = {}

data_list = []

//creating data_list

for (name, group) in grouped_raw_df_pandas:

//setting key for each group of data(unique combination with 4 levels)

key=['L'+str(name[0]),'L'+str(name[1]),'L'+str(name[2]),'L'+str(name[3])]

obj['id'] = key

obj['data'] = group

data_list.append(obj)

//creating generator_object

generator_obj = (each_obj for each_obj in data_list)

//calling calculation function with generator_obj

//the objected returned above has to be merged to create the final pyspark df

1) how do I pass the generator_obj to a function in pyspark in such a way that at one time the function (parallely) would be working on one obj i.e. the df associated with one unique key

相关阅读:
Top