问题描述:

I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.

The reasons why it's struggling:

  • some poems have no punctuation throughout their entire length (and sometimes no case)
  • some poems have sentences that run from one paragraph into another
  • some poems have capitalization at the beginning of every line

This is a particularly tricky one

(The system thought the first sentence ended at the "." at the beginning of the second stanza)

Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)

These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.

Is there an elegant way to do this? Or an alternative?

Thanks in advance!

(expected sentence output here)

网友答案:

This would be a neat project! I don't think anyone is working on it in our group at the moment, but I see no reason why we wouldn't incorporate a patch if you make one. The biggest challenge I see is that our sentence splitter is currently entirely rule-based, and therefore these sorts of "soft" decisions are relatively hard to incorporate.

A possible solution for your case could be to use language model "end of sentence" probabilities (Three options, in no particular order: https://kheafield.com/code/kenlm/, https://code.google.com/p/berkeleylm/, http://www.speech.sri.com/projects/srilm/). Then, line ends with a sufficiently high end of sentence probability could get split as new sentences.

相关阅读:
Top