问题描述:

I have a dataset which is comprised of various values concerning auto_sales in the USA.

I'm trying to predict the auto_sales for October 2010 using a simple OLS regression.

df2 = pd.read_csv('Paul_data/question12_prediction_data.csv')

window_size = 7 #-1 due to zero-indexing of array

window = df2.ix[0:window_size,:]

print window

result = sm.ols(formula="log_sales ~ log_sales_l2 + vehicleshopping_l2 + vehiclebrand_l2 + actual_sales_edmunds_l1 + isSummer + isWinter", data=df2).fit()

print result.predict()[df2[(df2.month == 10) & (df2.year == 2015)].index[0]]

window is the following data:

year month auto_sales log_sales log_sales_l1 log_sales_l2 \

0 2015 3 83352 11.330828 11.294807 11.317823

1 2015 4 83871 11.337035 11.330828 11.294807

2 2015 5 85489 11.356143 11.337035 11.330828

3 2015 6 84123 11.340035 11.356143 11.337035

4 2015 7 85320 11.354164 11.340035 11.356143

5 2015 8 NaN NaN 11.354164 11.340035

6 2015 9 NaN NaN NaN 11.354164

7 2015 10 NaN NaN NaN NaN

log_sales_l3 GT_vehicleshopping GT_vehiclemaintenance GT_suvs \

0 11.313523 0.1320 0.694 0.0680

1 11.317823 0.1150 0.745 0.0525

2 11.294807 0.1060 0.754 0.0560

3 11.330828 0.0950 0.785 0.0550

4 11.337035 0.1025 0.870 0.1075

5 11.356143 0.1140 0.794 0.1240

6 11.340035 NaN NaN NaN

7 NaN NaN NaN NaN

... vansminivans_l2 isWinter isSummer vehiclebrands \

0 ... 0.0900 1 0 0.08

1 ... 0.1250 0 0 0.09

2 ... 0.1580 0 0 0.09

3 ... 0.1750 0 1 0.12

4 ... 0.1920 0 1 0.17

5 ... 0.2100 0 1 NaN

6 ... 0.2175 0 0 NaN

7 ... NaN NaN NaN NaN

vehiclebrand_l1 vehiclebrand_l2 actual_sales_edmunds edmund_forecast \

0 0.05 0.03 1542841 1522881

1 0.08 0.05 1451790 1464176

2 0.09 0.08 1631234 1591221

3 0.09 0.09 1473142 1484487

4 0.12 0.09 1507643 1478025

5 0.17 0.12 1573573 1538958

6 NaN 0.17 NaN NaN

7 NaN NaN NaN NaN

actual_sales_edmunds_l1 edmund_forecast_l1

0 1255458 1285019

1 1542841 1522881

2 1451790 1464176

3 1631234 1591221

4 1473142 1484487

5 1507643 1478025

6 1573573 1538958

7 NaN NaN

[8 rows x 32 columns]

However I get the following error:


IndexError Traceback (most recent call last)

<ipython-input-83-16bf72335e7f> in <module>()

5

6 result = sm.ols(formula="log_sales ~ log_sales_l2 + vehicleshopping_l2 + vehiclebrand_l2 + actual_sales_edmunds_l1 + isSummer + isWinter", data=df2).fit()

----> 7 print result.predict()[df2[(df2.month == 10) & (df2.year == 2015)].index[0]]

8 #np.exp(result.predict(df2.ix[x+(window_size)]))

IndexError: index 7 is out of bounds for axis 0 with size 5

I'm not sure how to proceed at this point, I understand that I am trying to do out of sample prediction but everything I've tried so far has failed to solve the issue.

网友答案:

Your problem, I believe, is that the data over which you are regressing only has 5 entries in which not all the input is NaN. Therefore this:

result.predict()

Returns an array of 5 elements, but this:

df2[(df2.month == 10) & (df2.year == 2015)].index[0]

returns '7', as the slicing you are performing returns one row, which corresponds to the 8th row in your original dataframe. So you are asking "give me the 8th element of this array of length 5" and it therefore breaks.

网友答案:

user333700 was correct, this solved my problem:

df2 = pd.read_csv('Paul_data/question12_prediction_data.csv')
window_size = 4                                              #-1 due to zero-indexing of array
window = df2.ix[0:window_size,:]

result = sm.ols(formula="log_sales ~ log_sales_l2 + vehicleshopping_l2 + vehiclebrand_l2 + actual_sales_edmunds_l1 + isSummer + isWinter", data=window).fit()
index = df2[(df2.month == 10) & (df2.year == 2015)].index[0] -1
print result.predict(df2)[index]
相关阅读:
Top