I'm using mysql and would like to process a very large table with a primary key of 4 parts in blocks of 10,000 (marshalling data to another system). The database is offline when I am doing the processing so I don't have to worry about any modifications. Say the primary key is (`A`, `B`, `C`, `D`) which are all integers. I first tried using `LIMIT OFFSET` to achieve this like this:

``SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000 OFFSET 0;``

Where I increased the offset by 10000 on each call. This seemed to get very slow as it got towards the higher rows in the table. Is it not possible to do this `LIMIT OFFSET` efficiently?

Then I tried a different approach that uses comparison on the composite primary key. I can get the first block like this:

``SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000;``

If the last row of that block has `A = a`, `B = b`, `C = c`, and `D = d` then I can get the next block with:

``SELECT * FROM LargeTableWHEREA > a OR(A = a AND B > b) OR(A = a AND B = b AND C > c) OR(A = a AND B = b AND C = c AND D > d)ORDER BY (A, B, C, D) LIMIT 10000;``

And then repeat that for each block. This also seemed to slow down greatly as I got to the higher rows in the table. Is there a better way to do this? Am I missing something obvious?

Start processing data from the very start using just plain

``````SELECT *
FROM LargeTable
ORDER BY (A, B, C, D)
``````

and fetch rows one by one in your client code. You can fetch 10000 rows in your fetch loop if you want, or add `LIMIT 10000` clause. When you want to stop this block, remember last tuple (A, B, C, D) that was processed, lets call it `(A1, B1, C1, D1)`.

Now, when you want to restart from last point, fetch rows again one by one, but this time use tuple comparison in your WHERE clause:

``````SELECT *
FROM LargeTable
WHERE (A, B, C, D) > (A1, B1, C1, D1)
ORDER BY (A, B, C, D)
``````

(you can also add `LIMIT 10000` clause if you don't want to rely on client code exiting fetch loop prematurely). Key to this solution is that MySQL correctly implements tuple comparison.

EDIT: mentioned that optional `LIMIT 10000` can be added.

You're probably invoking a sequential scan of the table in some way.

Further, you're conditional SELECT is not doing what you think it does. It's short circuiting on the first condition A > a.

It'll be more efficient if you skip the ORDER BY and LIMIT and use a statement like:

``````SELECT *
FROM LargeTable
WHERE A = a AND B = b AND C = c;
``````

And just iterate through sets of a, b, and c.

A lot depends on the context in which you're doing your 'marshalling' operations, but is there a reason why you can't let the unconstrained SELECT run, and have your code do the grouping into blocks of 10,000 items?

In pseudo-code:

``````while (fetch_row succeeds)
{
add row to marshalled data
if (10,000 rows marshalled)
{
process 10,000 marshalled rows
set number of marshalled rows to 0
}
}
if (marshalled rows > 0)
{
process N marshalled rows
}
``````

Limit with offset needs to throw away rows until it finds the ones you actually want so it gets slow as you have a higher offset.

Here's an idea. Since your database is offline while you do this the data doesn't actually have to be present during the job. Why not move all processed rows to another table while processing them? I'm not sure it will be faster, it depends on how many indexes the table have but you should try it.

``````CREATE TABLE processed AS LargeTable;

SELECT * FROM LargeTable LIMIT 10000;
INSERT INTO processed SELECT * FROM LargeTable LIMIT 10000;
DELETE FROM LargeTable LIMIT 10000;

DELETE TABLE LargeTable;
RENAME TABLE processed TO LargeTable;
``````

Top