问题描述:

I Have 2 data sets as below on HDFS that are Tab separated:

A

--------

DATE PAGE VIEWS CLICKS

2014/01/21 200 50

2014/01/22 300 70

2014/01/23 150 100

B

--------

DATE PAGE VIEWS CLICKS

2014/01/23 50 25

2014/01/24 250 110

I would like to merge these 2 data sets to produce the following result:

C

--------

DATE PAGE VIEWS CLICKS

2014/01/21 200 50

2014/01/22 300 70

2014/01/23 200 125

2014/01/24 250 110

As you can see, the final data set C, has got the aggregate data for 2014/01/23 from both A & B, and included records from A not in B and vice versa. I know this can be acheived in PIG using a FULL OUTER JOIN but am not sure of a clean solution. In SQL, I could acheive the same as follows:

SELECT

COALESCE(A.DATE, B.DATE) AS DATE,

SUM(COALESCE(A.PAGE_VIEWS, 0)+COALESCE(B.PAGE_VIEWS, 0)) AS PAGE_VIEWS,

SUM(COALESCE(A.CLICKS, 0)+COALESCE(B.CLICKS, 0)) AS CLICKS

FROM A FULL OUTER JOIN B

ON A.DATE = B.DATE

GROUP BY COALESCE(A.DATE, B.DATE);

Would appreciate some guidance on how to implement this in PIG 0.10! Thanks!

网友答案:

One way to do it could be:

 AB1 = UNION A, B;
 AB2 = GROUP AB1 BY DATE;
 C = FOREACH AB2 GENERATE group AS DATE, SUM(PAGE_VIEWS) AS PAGE_VIEWS, SUM(CLICKS) AS CLICKS;
网友答案:

This is a good time to use COGROUP. COGROUP is like GROUP, except you apply the grouping to multiple aliases and as a result you have one bag of records for each of them, instead of just a single bag of records.

Edit:

Unfortunately in this case, because of the way Pig handles bag projection from empty bags, if A or B doesn't have a record for a given DATE, that bag of records will be empty, the projection of a column will result in a NULL bag, the SUM of that is NULL, and then when you add that to the other value, it turns the whole expression into a NULL. So the code has to become a little uglier to account for this.

C =
    FOREACH (COGROUP A BY DATE, B BY DATE)
    GENERATE
        group AS DATE,
        (IsEmpty(A)?0:SUM(A.PAGE_VIEWS))+(IsEmpty(B)?0:SUM(B.PAGE_VIEWS)) AS PAGE_VIEWS,
        (IsEmpty(A)?0:SUM(A.CLICKS))+(IsEmpty(B)?0:SUM(B.CLICKS)) AS CLICKS;

COGROUP is essential, however, if you wish to apply different operations to the different bags, so keep it in mind even if you decide that a UNION/GROUP solution is cleaner in this case.

相关阅读:
Top