问题描述:

I'm new to Dataflow, so this is probably an easy question.

I want to try out the Sessions windowing strategy. According to the windowing documentation, windowing is not applied until we've done a GroupByKey, so I'm trying to do that.

However, when I look at my pipeline in Google Cloud Platform, I can see that MapElements returns elements, but no elements are returned by GroupByKey ("Elements Added: -"). What am I doing wrong when grouping by key?

Here's the most relevant part of the code:

events = events

.apply(Window.named("eventsSessionsWindowing")

.<MyEvent>into(Sessions.withGapDuration(Duration.standardSeconds(3)))

);

PCollection<KV<String, MyEvent>> eventsKV = events

.apply(MapElements

.via((MyEvent e) -> KV.of(ExtractKey(e), e))

.withOutputType(new TypeDescriptor<KV<String, MyEvent>>() {}));

PCollection<KV<String, Iterable<MyEvent>>> eventsGrouped = eventsKV.apply(GroupByKey.<String, MyEvent>create());

网友答案:

A GroupByKey fires according to a triggering strategy, which determines when the system thinks that all data for this key/window has been received and it's time to group it and pass to downstream transforms. The default strategy is:

The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window.

Please see Default Trigger for details. You were seeing a delay of a couple of minutes that corresponded to the progression of PubSub's watermark.

相关阅读:
Top