Reputation: 183
Right now I have table in Mysql with 3 columns.
DocId Int
Match_DocId Int
Percentage Match Int
I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.
So if one document has 100 near duplicates, we have 100 rows for that particular document.
Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)
Here are few issues that I have:
Here are few queries that I run:
Check if particular document has any near duplicate. (this is relatively fast, but still slow)
Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?
This query takes lot of time. Most of the time it fails. I am going group by on percentage column.
Can this be managed with any available NoSql solution?
I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.
Upvotes: 3
Views: 417
Reputation: 35905
You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.
Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.
Check if particular document has any near duplicate. (this is relatively fast, but still slow)
HBase can do this job for big table.
Check for given set of documents, how many near duplicates are there in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)
This should be a good fit for Map-Reduce
There are many other solutions, see this link for a list and brief description of other NoSql databases.
Upvotes: 2