问题描述:

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \

< ~/corpus/training/news-commentary-v8.fr-en.en \

> ~/corpus/news-commentary-v8.fr-en.tok.en

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \

< ~/corpus/training/news-commentary-v8.fr-en.fr \

> ~/corpus/news-commentary-v8.fr-en.tok.fr

( say S = French & T = English)

I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.

Did I do anything wrong?

PS: OS = Ubuntu 14.04.5 LTS

Sony ultrabook

No dual boot.

相关阅读:
Top