问题描述:

I have a text. I split it into sentences and words. Next I must split it on tokens(,,.,?,!, ...) And I have a trouble here. Can you advise me which regex choose?

This is my code which split text into sentences and words.

String s = ReadFromFile();

String sentences[] = s.split("[.!?]\\s*");

String words[][] = new String[sentences.length][];

for (int i = 0; i < sentences.length; ++i)

{

words[i] = sentences[i].split("[\\p{Punct}\\s]+");

}

System.out.println(Arrays.deepToString(words));

So, I have a separete array of sentences and array of words. But with tokens I have a problem.

Input data

Arithmetic operators are used in mathematical expressions in the same way that they are used in algebra. The following table lists the arithmetic operators:

Assume integer variable A holds 10 and variable B holds 20, then:

Expected result

. : , :

网友答案:

Simplest solution is to not use split which requires from you description of things you don't want in result, but using Matcher#find and describing things you want to find.

String s = "Arithmetic operators are used in mathematical expressions in the same way that they are used in algebra. The following table lists the arithmetic operators: Assume integer variable A holds 10 and variable B holds 20, then:";

Pattern p = Pattern.compile("\\p{Punct}");
       //or Pattern.compile("[.]{3}|\\p{Punct}"); if you want to find "..."
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.println(m.group());
}

Output:

.
:
,
:

Instead of printing m.group() you can store it in collection like List.

相关阅读:
Top