问题描述:

I am currently working on large scale hierarchical text classification of ODP documents. The dataset provided to me is in the libSVM format. I am trying to run the linear kernel SVM of python's scikit-learn to develop the model. Below is the sample data from training samples:

29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3

33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1

The following is the code I have used to construct the linear SVM model


from sklearn.datasets import load_svmlight_file

from sklearn import svm

X_train, y_train = load_svmlight_file("/path-to-file/train.txt")

X_test, y_test = load_svmlight_file("/path-to-file/test.txt")

clf = svm.SVC(kernel='linear')

clf.fit(X_train, y_train)

print clf.score(X_test,y_test)

Upon running clf.score(), I get the following error:

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-6-b285fbfb3efe> in <module>()

1 start_time = time.time()

----> 2 print clf.score(X_test,y_test)

3 print time.time() - start_time, "seconds"

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)

292 """

293 from .metrics import accuracy_score

--> 294 return accuracy_score(y, self.predict(X))

295

296

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)

464 Class labels for samples in X.

465 """

--> 466 y = super(BaseSVC, self).predict(X)

467 return self.classes_.take(y.astype(np.int))

468

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)

280 y_pred : array, shape (n_samples,)

281 """

--> 282 X = self._validate_for_predict(X)

283 predict = self._sparse_predict if self._sparse else self._dense_predict

284 return predict(X)

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)

402 raise ValueError("X.shape[1] = %d should be equal to %d, "

403 "the number of features at training time" %

--> 404 (n_features, self.shape_fit_[1]))

405 return X

406

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

Can someone please let me know what is exactly wrong with either this code or the piece of data I have? Thanks in advance

Below attached are the values of X_train, y_train, X_test, and y_test:

X_train:

 (0, 9453) 1.0

(0, 11741) 1.0

(0, 18883) 14.0

(0, 26839) 1.0

(0, 35146) 1.0

(0, 52781) 1.0

(0, 72082) 1.0

(0, 73243) 1.0

(0, 78944) 1.0

(0, 79912) 1.0

(0, 79985) 1.0

(0, 86709) 3.0

(0, 117285) 1.0

(0, 139819) 1.0

(0, 142457) 1.0

(0, 146314) 1.0

(0, 151004) 2.0

(0, 161453) 3.0

(0, 172236) 1.0

(0, 187531) 2.0

(0, 202462) 1.0

(0, 210417) 1.0

(0, 250581) 1.0

(0, 251689) 1.0

(0, 296384) 2.0

: :

(4462, 735469) 1.0

(4462, 737059) 15.0

(4462, 740127) 1.0

(4462, 743798) 1.0

(4462, 766063) 1.0

(4462, 778958) 2.0

(4462, 784004) 4.0

(4462, 837264) 2.0

(4462, 839095) 22.0

(4462, 844735) 6.0

(4462, 859721) 2.0

(4462, 875267) 1.0

(4462, 910761) 1.0

(4462, 931244) 1.0

(4462, 945069) 6.0

(4462, 948728) 1.0

(4462, 948850) 2.0

(4462, 957682) 1.0

(4462, 975170) 1.0

(4462, 989192) 1.0

(4462, 1014294) 1.0

(4462, 1042424) 1.0

(4462, 1049027) 1.0

(4462, 1072931) 1.0

(4462, 1145790) 1.0

y_train:

[ 2.90000000e+01 3.30000000e+01 3.30000000e+01 ..., 1.65475000e+05

1.65518000e+05 1.65518000e+05]

X_test:

 (0, 18573) 1.0

(0, 23501) 1.0

(0, 29954) 1.0

(0, 42112) 1.0

(0, 46402) 1.0

(0, 63041) 2.0

(0, 67942) 2.0

(0, 83522) 1.0

(0, 88413) 2.0

(0, 99454) 1.0

(0, 126041) 1.0

(0, 139819) 1.0

(0, 142678) 1.0

(0, 151004) 1.0

(0, 166351) 2.0

(0, 173794) 1.0

(0, 192162) 3.0

(0, 210417) 2.0

(0, 254468) 1.0

(0, 263895) 2.0

(0, 277567) 1.0

(0, 278419) 2.0

(0, 279181) 2.0

(0, 281319) 2.0

(0, 298898) 1.0

: :

(1857, 1100504) 3.0

(1857, 1103247) 1.0

(1857, 1105578) 1.0

(1857, 1108986) 2.0

(1857, 1118486) 1.0

(1857, 1120807) 9.0

(1857, 1129243) 2.0

(1857, 1131786) 1.0

(1857, 1134029) 2.0

(1857, 1134410) 5.0

(1857, 1134494) 1.0

(1857, 1139045) 25.0

(1857, 1142239) 3.0

(1857, 1142651) 1.0

(1857, 1144787) 1.0

(1857, 1151891) 1.0

(1857, 1152094) 1.0

(1857, 1157533) 1.0

(1857, 1159376) 1.0

(1857, 1178944) 1.0

(1857, 1181310) 2.0

(1857, 1182023) 1.0

(1857, 1187098) 1.0

(1857, 1194344) 2.0

(1857, 1195819) 9.0

y_test:

[ 2.90000000e+01 3.30000000e+01 1.56000000e+02 ..., 1.65434000e+05

1.65475000e+05 1.65518000e+05]

网友答案:

The error message

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

explains itself: the number of features in the testing data is different compared to the training data, which has been used to train the model. That is, X_train.shape[1] is not equal to X_test.shape[1].

You should check why they are not equal, as they should be.

One possibility is that they are loaded as sparse matrices and the number of features is inferred by load_svmlight_file. If the testing data contains features unseen by the training data, the resulting X_test might have a larger dimension. To avoid this, you can specify the number of features in load_svmlight_file by passing the argument n_features.

网友答案:

You can use n_features option.

X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])

This error also can be solved by using load_svmlight_files

from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])
相关阅读:
Top