Reputation: 6276
I'm trying to export large set of elasticsearch query
results to csv
with pandas
. I'm taking about minimum of 1 million
of records
which needs to be exported
. It is taking too much of time in execution, while checking the logs I see that pandas
is taking too much of time while appending the data.
To achieve this I'm using scan
helper function and split it into 5000 chunks, I'm able to extract all the elasticsearch
data with infinite scroll
with this
My code looks like:
for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
scan_source_data = hit["_source"]
scan_id = hit["_id"]
scan_doc_data = pandas.Series(scan_source_data, name=scan_id)
scan_docs = scan_docs.append(scan_doc_data)
scan_docs.to_csv("/tmp/scandocs.csv", ",")
When I checked the logs:
[2021-08-26 08:10:51,876: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/server-logs/_search?scroll=20h&size=5000 [status:200 request:0.814s]
[2021-08-26 08:13:39,213: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.653s]
[2021-08-26 08:17:50,795: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.813s]
[2021-08-26 08:23:15,433: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.554s]
[2021-08-26 08:30:25,491: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.585s]
[2021-08-26 08:38:42,866: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.483s]
As you can see first 5000 chunk took 3 minutes, next 5000 chunk took 4 minutes, next one 6 minutes, next 8 minutes and this time increases when the size of the append
increases.
I tried exporting around 50,000
records in one go it took around one and half hour
to get the CSV
. While trying to export 200,000
records I get time gap of 1 hour between two scroll
queries run in the logs and was running for more than 24 hours.
I'm a newbie to python world and doesn't know much about it, but I'm sure there must be some better way.
I'm unable to find an optimised way to resolve this. Any info on this is appreciated. Thanks.
Upvotes: 0
Views: 1227
Reputation: 4548
Edit adding timing print statements to see if the scan
call is slow, or the for
loop is slow. If it is the scan
then you might be out of luck.
import time
with open('/tmp/scandocs.csv','w') as f_out:
scan_start_time = time.time()
hits = scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000)
scan_duration = time.time()-scan_start_time
print(scan_duration)
loop_start_time = time.time()
for hit in hits:
scan_source_data = hit["_source"]
scan_id = hit["_id"]
output_line = '{},'.format(scan_id)
output_line += ','.join(scan_source_data)
f_out.write(output_line+'\n')
loop_duration = time.time()-loop_start_time
print(loop_duration)
Before edit:
Appending in pandas is slow (Improve Row Append Performance On Pandas DataFrames). Do you need to have all the scan_doc_data
in memory at once? Or can you write it out as soon as you retrieve it? If you can write out immediately, I'd suggest writing directly to an open file handle and creating the CSV yourself line by line:
with open('/tmp/scandocs.csv','w') as f_out:
for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
scan_source_data = hit["_source"]
scan_id = hit["_id"]
output_line = '{},'.format(scan_id)
output_line += ','.join(scan_source_data)
f_out.write(output_line+'\n')
I haven't tested this so there may be errors
Upvotes: 1