问题描述:

I hope you are all fine,

I have a requirement of saving logs (a huge quantity) into HDInsight (into blobs and then using hive in order to query them via some BI Analytics software)

For a day I have something like 30 millons of .json archives. The problem is that the client wants some reports based on minutes, so that is my granularity.

I was thinking on doing static partitioning like this:

CREATE EXTERNAL TABLE Logs (ex string, exTwo string)

PARTITIONED BY (yearMonthDay STRING, hour INT, minute INT)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

ALTER TABLE Logs ADD PARTITION (yearMonthDay = '20160901', hour = 12, minute = 51) LOCATION 'wasb://[email protected]/20160901/12/51';

So for 30 millons of files a day I would have an approx of 30m / 24 hours / 60 minutes = 20800 json on each partition.

The problem is that for a day I would have 1440 partitions (24 hours x 60 minutes) and it is a huge amount of partitions!

Since I was not able to found a "must not exceed X number of partition" best pratice, I really dont know if this approach is valid or not, but I guess the NameNode would not be very happy with this.

I read some articles about bucketed tables and I thought that maybe an approach like this can help me a little more:

CREATE EXTERNAL TABLE Logs (ex string, exTwo string, minutes int)

PARTITIONED BY (yearMonthDay STRING, hour INT)

CLUSTERED BY(minutes) INTO 60 BUCKETS;

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

-- I dont know yet if this sentece would work but I guess it is something like this

set hive.enforce.bucketing = true;

ALTER TABLE Logs ADD PARTITION (yearMonthDay = '20160901', hour = 12) LOCATION 'wasb://[email protected]/20160901/12';

So doing this I would have per day 24 partitions and the files bucketed by minutes. If this approach any better?

Is there other way of handling this tons of data which I am currently unaware? (I am forced to use HDInsight)

Thanks,

Regards

相关阅读:
Top