问题描述:

I’d like to use Java 8 streams to take a stream of strings (for example read from a plain text file) and produce a stream of sentences. I assume sentences can cross line boundaries.

So for example, I want to go from:

"This is the", "first sentence. This is the", "second sentence."

to:

"This is the first sentence.", "This is the second sentence."

I can see that it’s possible to get a stream of parts of sentences as follows:

Pattern p = Pattern.compile("\\.");

Stream<String> lines

= Stream.of("This is the", "first sentence. This is the", "second sentence.");

Stream<String> result = lines.flatMap(s -> p.splitAsStream(s));

But then I’m not sure how to produce a stream to join the fragments into sentences. I want to do this in a lazy way so that only what is needed from the original stream is read. Any ideas?

网友答案:

Breaking text into sentences is not that easy as just looking for dots. E.g., you don’t want to split in between “Mr.Smith”…

Thankfully, there is already a JRE class which takes care of that, the BreakIterator. What it doesn’t have, is Stream support, so in order to use it with streams, some support code around it is required:

public class SentenceStream extends Spliterators.AbstractSpliterator<String>
implements Consumer<CharSequence> {

    public static Stream<String> sentences(Stream<? extends CharSequence> s) {
        return StreamSupport.stream(new SentenceStream(s.spliterator()), false);
    }
    Spliterator<? extends CharSequence> source;
    CharBuffer buffer;
    BreakIterator iterator;

    public SentenceStream(Spliterator<? extends CharSequence> source) {
        super(Long.MAX_VALUE, ORDERED|NONNULL);
        this.source = source;
        iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);
        buffer=CharBuffer.allocate(100);
        buffer.flip();
    }

    @Override
    public boolean tryAdvance(Consumer<? super String> action) {
        for(;;) {
            int next=iterator.next();
            if(next!=BreakIterator.DONE && next!=buffer.limit()) {
                action.accept(buffer.subSequence(0, next-buffer.position()).toString());
                buffer.position(next);
                return true;
            }
            if(!source.tryAdvance(this)) {
                if(buffer.hasRemaining()) {
                    action.accept(buffer.toString());
                    buffer.position(0).limit(0);
                    return true;
                }
                return false;
            }
            iterator.setText(buffer.toString());
        }
    }

    @Override
    public void accept(CharSequence t) {
        buffer.compact();
        if(buffer.remaining()<t.length()) {
            CharBuffer bigger=CharBuffer.allocate(
                Math.max(buffer.capacity()*2, buffer.position()+t.length()));
            buffer.flip();
            bigger.put(buffer);
            buffer=bigger;
        }
        buffer.append(t).flip();
    }
}

With that support class, you can simply say, e.g.:

Stream<String> lines = Stream.of(
    "This is the ", "first sentence. This is the ", "second sentence.");
sentences(lines).forEachOrdered(System.out::println);
网友答案:

This is a sequential, stateful problem, which Stream's designer is not too fond of.

In a more general sense, you are implementing a lexer, which converts a sequence of tokens to a sequence of another type of tokens. While you might use Stream to solve it with tricks and hacks, there is really no reason to. Just because Stream is there doesn't mean we have to use it for everything.

That being said, an answer to your question is to use flatMap() with a stateful function that holds intermediary data and emits the whole sentence when a dot is encountered. There is also the issue of EOF - you'll need a sentinel value for EOF in the source stream so that the function can react to it.

网友答案:

My StreamEx library has a collapse method which is designed to solve such tasks. First let's change your regexp to look-behind one, to leave the ending dots, so we can later use them:

StreamEx.of(input).flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)

Here the input is array, list, JDK stream or just comma-separated strings.

Next we collapse two strings if the first one does not end with dot. The merging function should join both parts into single string adding a space between them:

.collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)

Finally we should trim the leading and trailing spaces if any:

.map(String::trim);

The whole code is here:

List<String> lines = Arrays.asList("This is the", "first sentence.  This is the",
    "second sentence. Third sentence. Fourth", "sentence. Fifth sentence.", "The last");
Stream<String> stream = StreamEx.of(lines)
        .flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)
        .collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)
        .map(String::trim);
stream.forEach(System.out::println);

The output is the following:

This is the first sentence.
This is the second sentence.
Third sentence.
Fourth sentence.
Fifth sentence.
The last

Update: since StreamEx 0.3.4 version you can safely do the same with parallel stream.

相关阅读:
Top