How to actually use merge sort for large data sets?

Suppose that I have several sorted files with the following data:

1.txt

``122``

2.txt

``345``

3.txt

``111``

Suppose that we can't hold all files' contents in memory at the same time (let's say we can hold only two numbers from each file).

I heard that I can use some kind of R-way merge sort in this case but I don't understand how can I actually do it.

As you see, the first iteration will give us the following sorted sequence:

``1 1 1 2 3 4``

, so we flush it to the output file. However, we will get `1` again (from the `3.txt` file) on the next iteration, so the whole resulting sequence is wrong!

Start by filling as many variables as you have files, one variable attached to one file. At each step find the lowest value of the three variables, and flush it to the output while filling it again from the same file.

``````| 1.txt | 2.txt | 3.txt |
| 1     | 3     | 1     | output 1 refill from file 1
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | nil   | output 2 refill from file 1
| 2     | 3     | nil   | output 2 refill from file 1
| nil   | 3     | nil   | output 3 refill from file 2
| nil   | 4     | nil   | output 4 refill from file 2
| nil   | 5     | nil   | output 5 refill from file 2
| nil   | nil   | nil   | end
``````

`I heard that I can use some kind of R-way merge sort in this case but I don't understand how can I actually do it.`

N-way merges are quite easy to explain. You open all the files, get the first element from each and put them into a heap. The algorithm then proceeds by getting the smallest element from the heap (pop), writing it to your output buffer and then read the next element from the file this item originated from. Repeat until all files are empty.

Top