问题描述:

We have a timeseries dataset, where each event has a timestamp and a set of keys/values. Each event has roughly the same keys, though values can vary. We're evaluating bigtable as an option for storing this data, using a dataset with the following properties:

  • ~700gb raw
  • ~600 million events
  • ~20 key/value pairs per event, no values over 128 bytes

We chose two different schemes to test:

  • One row per event, with each key being a column within a "kvp" column family
  • One row per event, encoding all the key/value pairs using msgpack and storing that as a single column

What we found in testing was the msgpack'ing the data was vastly superior in almost every way:

  • Space used with msgpack was roughly half of non-msgpack (145gb vs 350gb)
  • Reading speeds were much better, about 3m/min vs 900k/min
  • Reading throughput was also better, we managed to max the msgpack scheme at 240Mb/s vs only 54mb/s

Write throughput was roughly the same between the two, however.

Here's the code for reading which is being used for each test:

column-keys:

err := tbl.ReadRows(ctx, bigtable.InfiniteRange(""), func(row bigtable.Row) bool {

lm.Incr("submissions read")

return true

})

msgpack:

family := "kvp"

err := tbl.ReadRows(ctx, bigtable.InfiniteRange(""), func(row bigtable.Row) bool {

lm.Incr("submissions read")

unmarshal(row[family][0].Value)

return true

})

(lm is a metrics aggregator, it adds a negligible lag).

My question is this: Are the discrepancies between the two scheme's expected? Or are we doing something wrong? The read throughput being different isn't too unexpected, but the degree to which it's different is. The amount of data being transferred for each scheme shouldn't be that different I wouldn't think.

Similarly, is the storage size difference of each expected? My understanding is that bigtable does compression, so even for the non-msgpack it should theoretically end up a pretty similar size, especially since the column names are fairly constant across the dataset. This could all be totally wrong though.

Any light that could be shed on this would be very helpful, and I would be happy to provide any further information to help debug.

相关阅读:
Top