问题描述:

I have MongoDB set up on an Amazon EC2 micro instance. There are about 7 million items in the db. I'm trying to iterate over all of them and print out some information about each item. I'm using the python wrapper to do so.

import pymongo as p

db_client = p.MongoClient()

db = db_client.my_awesome_db

photo_collection = db.photos

for photo in photo_collection.find():

print photo['attr']

I'm not storing anything in memory and the DB isn't being used by anything else.

Since the query was running long, I used limit() to estimate how long it should take. I'm seeing non-linear times, the larger I make the limit. For example,

  • limit -> time
  • 1,000 -> 1 second
  • 10,000 -> 10 seconds
  • 100,000 -> 720 seconds (~ 12 minutes)
  • 700,000 -> 9000 seconds

This isn't ridiculous, but it's larger than linear (the jump from 10k to 100k seems pretty bad). I can easily iterate over a 7 million line file in a second, but at this rate it will take 25 hours to iterate over the whole DB.

Do I have something configured wrong? Is find() not the correct function to use?

相关阅读:
Top