Reputation: 427
I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t
on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index
set to 0 and normTable
, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector"
fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable
).
I could move the batches
function to the remote server but then how would I update my local normTable
remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX
variable so that it creates a new variable that equals the count of divyUP
?
Upvotes: 1
Views: 259
Reputation: 376
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"
Upvotes: 1
Reputation: 13572
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N#
(that would do it in 10 batches)
Upvotes: 1
Reputation: 3969
I would suggest few changes as you are trying to connect to remote server:
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd
function on local macine asynchronously and will pass 2 arguments.
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch
function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
Upvotes: 3