Reputation: 183

Which NoSql to Use for storing billions of Integer pair data?

Right now I have table in Mysql with 3 columns.

DocId             Int
Match_DocId       Int
Percentage Match  Int

I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.

So if one document has 100 near duplicates, we have 100 rows for that particular document.

Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)

Here are few issues that I have:

Getting all there records in mysql table is taking lot of time.
Query takes lot of time as well.

Here are few queries that I run:

Check if particular document has any near duplicate. (this is relatively fast, but still slow)
Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?

This query takes lot of time. Most of the time it fails. I am going group by on percentage column.

Can this be managed with any available NoSql solution?

I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.

Upvotes: 3

Answers (2)

oleksii

Reputation: 35905

MySQL

You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.

NoSql

Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.

Check if particular document has any near duplicate. (this is relatively fast, but still slow)

HBase can do this job for big table.

Check for given set of documents, how many near duplicates are there in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)

This should be a good fit for Map-Reduce

There are many other solutions, see this link for a list and brief description of other NoSql databases.

Upvotes: 2

favoretti

Reputation: 30167

We have good experiences with Redis. It's fast, can be make as reliable as you want it to. Other options could be CouchDB or Cassandra.

Upvotes: 1

Which NoSql to Use for storing billions of Integer pair data?

Answers (2)

MySQL

NoSql

Related Questions