Here the task is classification of videos.
One sample is a single frame.
A group of samples is consecutive frames of a video.
I have a Neural network that gives me an accuracy of 65% for a single sample (frame) .
If I use the output of last but one layer(dense layer (4096)) as a feature extractor and train an lstm with group of these features.
Would the prediction of the group of samples always be better than the single sample approach?
In this paper thy have use a CNN as a feature extractor and then used their features to train an LSTM.
The LSTM model on top of the CNN always seems to give an improved result.
The only reasons I can think of where they dont perform better with the CNN+LSTM is.
1) The extracted features do not contain time dependent data that the LSTM can exploit.