Kapil Earanky
Kapil Earanky

Reputation: 211

Improve query performance for large number of records in rethinkdb

I have a rethinkdb table with 150 mn records, containing schema-less JSON data. I'm querying a nested field in the JSON, for example the 'Gate No.' field in the below JSON.

{ 'Name' : 'XYZ', 'Age' : 22, 'Address' : { 'Gate No.' : 7, 'Society' : 'ABC' } }

When I ran the same query for a table containing 1mn records, the query returned in 680 ms, however with 150 mn records, the query doesn't return at all. From the web console, it runs for a while and later gives an error : Query terminated by an unknown cause. From my Java application, the query seems to run forever.

I've tried sharding with 4 servers, each holding ~37 mn documents but that doesn't seem to improve the situation. How can I get the query to run?

PS.: My JSON data is completely schema-less, so indexing the data is not a viable option.

Upvotes: 0

Views: 373

Answers (1)

dalanmiller
dalanmiller

Reputation: 3672

based on the extra information you said, I'd tackle this problem like this:

First you should definitely make an index. You said your data is schemaless but I assume that all or most of the entries have an Address and Gate No field? If this is the case then you would create an index like so (in either the data explorer or translate this query to Java):

r.db("dbName").table("tableName").indexCreate('gate_no', r.row("Address")("Gate No.")

Without creating an index you're automatically resorting to doing a table scan every time you're searching for this document. You typically want to make indexes on queries you perform often. If you're constantly going to search Addresses by Gate No., this will help.

Now you use .getAll to specify the value you want to find and explicitly pass the index you want to use which will return the value you're looking for. (Btw it's called .getAll because unlike primary keys, secondary ones can have multiple values)

r.db("dbName").table("tableName").getAll(7, {index: 'gate_no'})

This will get you your result in a fraction of the time. (Not sure of the specifics but on average O(n log n) depending on the current balance of the index tree).

As for the Data Explorer, it's really a tool to do simple explorations of your data set and wasn't designed to parse millions of documents. Doing these kinds of tests I find dropping into ipython or the node repl makes things a lot easier and translatable to the end product.

Lastly, the Java driver shouldn't be behaving like that. You should definitely open up an issue on Github so we can properly invesitgate - https://github.com/rethinkdb/rethinkdb/issues/new.

Upvotes: 3

Related Questions