问题描述:

I am exporting a massive data set from dynamics to elastic search.

Below are the steps:

  1. Get data from SQL (I am using entity framework). Let's name the main type as contact.
  2. Then I am grouping data by a defined size and serializing them.
  3. Format data for bulk upload as per the ES Docs
  4. Call HttpPost and send the data to ES Endpoint.

I am doing extensive logging for the time it takes and any errors.

It all works and my export exports the data in an hour.

That said, I have observed that the HttpPost's reponse time keeps increasing. I have looked for any memory leaks I could have or anything I should dispose and haven't. I want to make sure it will not haunt me later.

So, what are the possible reasons for the increase of response times?

How should I go about investigation the issue ?

网友答案:

I use ES 1.7 and I index about 10 mln documents using similar scenario. From my experience if you push ES to hard it will slow down and sometimes fail with OutOfMemory exceptions. I don't know if it is still an issue with newer versions.

IMHO it is because ES needs some time to process bulks - it accepts data, index it, but after that it does some background work to optimize the index.

To overcome the issue I experimented with parameters: a single bulk size (N), sleep time between indexing bulks (S1), and much longer sleep between a few (M) bulks (S2). For my dataset and my hardware I ended with N=5000, S1=1s, M=10, S2=10s. To choose safe values I observe usage of CPU, memory and I/O. For example increased I/O usage for extended period may suggest that ES will break soon.

I'm sure it is very dependent on hardware you have, especially give ES as much memory as you can!

相关阅读:
Top