问题描述:

What would be the best way to store a very large amount of data for a web-based application?

Each record has just 3 fields, but there will be around 144 million records a day - stored for one month - 4,464,000,000 records total. Let's round up to 5 billion.

Data has to be searchable on keyword & return results as fast as possible to the end user.

  • Which programming language?
  • JSON / XML / Some Database System I've Never Heard Of?
  • What sort of infrastructure? Imagine this system is only serving the needs of a maximum of 1,000 users at the same time.

I assume the code is the same whether you're searching 10 records or 10 billion, you just have to be a whole lot more efficient. I also assume mySQL/PHP doesn't stand a chance, and we're going to be paying out a very large sum for a hosting solution.

Just need some guidance on where to start, really. Thank you!

网友答案:

There are many tools in the Big Data ecosystem (NoSQL databases, distributed computing, machine learning, search, etc) which can form an answer to your question. Since your application will be write-heavy, I would advocate Apache Cassandra for its excellent write-performance (although it requires more data modeling than a NoSQL/document database such as MongoDB). You also need a Solr or ElasticSearch based search solution, and Map/Reduce for indexes and queries.

The programming language doesn't matter unless you have business end-users which will be writing queries against your Big Data in which case you can use something very SQL-like such as Hive or Pig. To get you started, the following (recent) link might give you some idea on how to pick an analytics stack based on your needs - please note that every database or distributed computing paradigm specializes for some particular use case:

How we picked our analytics stack

Also look at High Scalability for various use cases on how companies tackle their scalability problems.

相关阅读:
Top