问题描述:

Hi

I can't find anywhere about Berkeley DB Bulk insert feature written in C. I can find about update, select and delete at http://download.oracle.com/docs/cd/E17076_02/html/programmer_reference/am_misc_bulk.html. Can anybody tell me how to write this bulk insert feature? I'm new to both C and Berkeley DB.

  • I also want to write quite a lot of data (may be 30GB) using this feature , so please also advise me for the performance too.
  • my boss wants me to use Hash access method.

Thanks

Kevin

网友答案:

I don't know if this is going to help or hurt given your newness to both C and BerkleyDB.

You would need to use the DB_MULTIPLE flag with DB->put().

In order to do this you need to create a bulk DBT structure for your keys, and one for your data. The buffers must be large enough to hold the entire set of keys and values accordingly. You then have to initialize both of them with DB_MULTIPLE_WRITE_INIT, then add your keys and values to the respective buffer with DB_MULTIPLE_WRITE_NEXT.

This was added in 4.8 and honestly, I can't find a concrete example for you via google.

EDIT: At least in the latest releases there's example code provided with BerkeleyDB for bulk operations. You need to take a look at examples/c/ex_bulk.c

网友答案:

You can try doing one or more commits/transactions. For example: start a transaction, do inserts, end transaction. That's a normal way to speed up database changes because it reduces the transaction overhead of independent SQL statements.

I'm not familiar with Berkely DB API, so it might have something better suited for bulk operations, just offering advice.

Edit:
Some links regarding transactions:
1. Wikipedia entry
2. Berkley DB Transaction Throughput

网友答案:

For the sake of C++ users, heres how to do it using the Berkeley C++ api, which is both undocumented, and has zero examples. It does work pretty well though!.

Create a Dbt (a database Thang, Im not making that up) to keep hold of a memory buffer:

void* buf = new unsigned char[bufferSize]; dbt = new Dbt; dbt->set_data(buf); dbt->set_ulen(bufferSize); dbt->set_flags(DB_DBT_USERMEM);

Associate that with a DBMultipleKeyDataBuilder:

DBMultipleKeyDataBuilder* dbi=new DBMultipleKeyDataBuilder(dbt);

Append your Key and Value pairs one at a time until done or buffer full

dbi->append(curKeyBuf,curKeyLen,curDataBuf,curDataLen); ...(lots more of these)...

USe your DB* db, and a transaction if you wish in txn, and bulk write: db->put(txn, dbt, NULL, DB_MULTIPLE_KEY );

delete dbi;

I've missed lots of detail, such as checking the buffer is full, or big enough to hold even one KV pair.

A DBMultipleKeyDataBuilder can only be used once, but a really efficient implementation will keep a pool of buffer Dbt objects and reuse them. You can use these Dbts for bulk reading as well, so a common pool of them can be used.

网友答案:

The Berkeley DB forums are monitored by several Berkeley DB developers. That would be another good place to post such questions.

网友答案:

Bulk loading a hash in Berkeley DB has been a problem in the past. The following paper explores this further and suggests an algorithm to speed it up. The suggested algorithm sorts the data in a way linear hash (in Berkeley DB) expects hence loading can be done in one scan of the sorted data. This scales very well for large datasets. Davood Rafiei, Cheng Hu, Bulk Loading a Linear Hash File, Proc. of the DaWak Conference, 2006. https://webdocs.cs.ualberta.ca/~drafiei/papers/dawak06.pdf

相关阅读:
Top