Nitish Kumar
Nitish Kumar

Reputation: 6276

Optimising millions of Elasticsearch query export to csv with pandas

I'm trying to export large set of elasticsearch query results to csv with pandas. I'm taking about minimum of 1 million of records which needs to be exported. It is taking too much of time in execution, while checking the logs I see that pandas is taking too much of time while appending the data.

To achieve this I'm using scan helper function and split it into 5000 chunks, I'm able to extract all the elasticsearch data with infinite scroll with this

My code looks like:

for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
    scan_source_data = hit["_source"]
    scan_id = hit["_id"]
    scan_doc_data = pandas.Series(scan_source_data, name=scan_id)
    scan_docs = scan_docs.append(scan_doc_data)

scan_docs.to_csv("/tmp/scandocs.csv", ",")

When I checked the logs:

[2021-08-26 08:10:51,876: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/server-logs/_search?scroll=20h&size=5000 [status:200 request:0.814s]
[2021-08-26 08:13:39,213: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.653s]
[2021-08-26 08:17:50,795: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.813s]
[2021-08-26 08:23:15,433: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.554s]
[2021-08-26 08:30:25,491: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.585s]
[2021-08-26 08:38:42,866: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.483s]

As you can see first 5000 chunk took 3 minutes, next 5000 chunk took 4 minutes, next one 6 minutes, next 8 minutes and this time increases when the size of the append increases.

I tried exporting around 50,000 records in one go it took around one and half hour to get the CSV. While trying to export 200,000 records I get time gap of 1 hour between two scroll queries run in the logs and was running for more than 24 hours.

I'm a newbie to python world and doesn't know much about it, but I'm sure there must be some better way.

I'm unable to find an optimised way to resolve this. Any info on this is appreciated. Thanks.

Upvotes: 0

Views: 1227

Answers (1)

mitoRibo
mitoRibo

Reputation: 4548

Edit adding timing print statements to see if the scan call is slow, or the for loop is slow. If it is the scan then you might be out of luck.

import time

with open('/tmp/scandocs.csv','w') as f_out:
    scan_start_time = time.time()
    hits = scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000)
    scan_duration = time.time()-scan_start_time
    print(scan_duration)

    loop_start_time = time.time()
    for hit in hits:
        scan_source_data = hit["_source"]
        scan_id = hit["_id"]
        output_line = '{},'.format(scan_id)
        output_line += ','.join(scan_source_data)
        f_out.write(output_line+'\n')

loop_duration = time.time()-loop_start_time
print(loop_duration)

Before edit:

Appending in pandas is slow (Improve Row Append Performance On Pandas DataFrames). Do you need to have all the scan_doc_data in memory at once? Or can you write it out as soon as you retrieve it? If you can write out immediately, I'd suggest writing directly to an open file handle and creating the CSV yourself line by line:

with open('/tmp/scandocs.csv','w') as f_out:
    for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
        scan_source_data = hit["_source"]
        scan_id = hit["_id"]
        output_line = '{},'.format(scan_id)
        output_line += ','.join(scan_source_data)
        f_out.write(output_line+'\n')

I haven't tested this so there may be errors

Upvotes: 1

Related Questions