Reputation: 211
I have a rethinkdb table with 150 mn records, containing schema-less JSON data. I'm querying a nested field in the JSON, for example the 'Gate No.' field in the below JSON.
{ 'Name' : 'XYZ', 'Age' : 22, 'Address' : { 'Gate No.' : 7, 'Society' : 'ABC' } }
When I ran the same query for a table containing 1mn records, the query returned in 680 ms, however with 150 mn records, the query doesn't return at all. From the web console, it runs for a while and later gives an error : Query terminated by an unknown cause
. From my Java application, the query seems to run forever.
I've tried sharding with 4 servers, each holding ~37 mn documents but that doesn't seem to improve the situation. How can I get the query to run?
PS.: My JSON data is completely schema-less, so indexing the data is not a viable option.
Upvotes: 0
Views: 373
Reputation: 3672
based on the extra information you said, I'd tackle this problem like this:
First you should definitely make an index. You said your data is schemaless but I assume that all or most of the entries have an Address
and Gate No
field? If this is the case then you would create an index like so (in either the data explorer or translate this query to Java):
r.db("dbName").table("tableName").indexCreate('gate_no', r.row("Address")("Gate No.")
Without creating an index you're automatically resorting to doing a table scan every time you're searching for this document. You typically want to make indexes on queries you perform often. If you're constantly going to search Address
es by Gate No.
, this will help.
Now you use .getAll
to specify the value you want to find and explicitly pass the index you want to use which will return the value you're looking for. (Btw it's called .getAll
because unlike primary keys, secondary ones can have multiple values)
r.db("dbName").table("tableName").getAll(7, {index: 'gate_no'})
This will get you your result in a fraction of the time. (Not sure of the specifics but on average O(n log n) depending on the current balance of the index tree).
As for the Data Explorer, it's really a tool to do simple explorations of your data set and wasn't designed to parse millions of documents. Doing these kinds of tests I find dropping into ipython
or the node
repl makes things a lot easier and translatable to the end product.
Lastly, the Java driver shouldn't be behaving like that. You should definitely open up an issue on Github so we can properly invesitgate - https://github.com/rethinkdb/rethinkdb/issues/new.
Upvotes: 3