Reputation: 15909

How to export large number of rows from InfluxDB to CSV using python client?

I am trying to export a large number of rows (160.000.000+) from influxDB to a csv file. So far i am just blowing up the memory on the machine that runs the query. I am lost on how i could export this amount of rows without blowing the memory of the machine that runs the export. Any thoughts on this?? I have also tried the CLI without any luck.

I have tried the following code..

def export_to_csv_file(self, file_name, header, query):
    logger.info("Executing query {}".format(query))
    dfs = pd.DataFrame(self.client.query(query, chunked=True, chunk_size=10000).get_points())
    dfs.to_csv('dummy.txt', index=False, columns=header, encoding='utf-8')

Any hints or tips on how i could export the data successfull.

Upvotes: 2

Answers (1)

Yuri Lachin

Reputation: 1500

This can be done with influx_inspect CLI tool + some bash/grep/tr/cut postprocessing. It worked for me without memory problems exporting >300M rows from InfluxDB v1.2.4.

The key was to use influx_inspect - commands like influx -database 'metrics' -execute 'select * from cpu' -format 'csv' failed miserably.

Script like this will create files with your data in influx lineprotocol format:

#!/bin/bash
month=2017-04

db=YOUR_DBNAME
rp=autogen
datadir=/data/influxdb/data
waldir=/data/influxdb/wal
outdir=/somepath/influx_export

for d in 0{1..9} 10 ; do
    echo $(date) Running time influx_inspect export -database $db -retention $rp -datadir $datadir -waldir $waldir -compress -start ${month}-${d}T00:00:00Z -end ${month}-${d}T23:59:59Z -out $outdir/export.${month}-${d}.lineproto.gz
    time influx_inspect export -database $db -retention $rp -datadir $datadir -waldir $waldir -compress -start ${month}-${d}T00:00:00Z -end ${month}-${d}T23:59:59Z -out $outdir/export.${month}-${d}.lineproto.gz
    echo $(date) Done
done

Then these lineproto files can be converted to CSV with postprocessing step.

In my case data lines in output file looked like:

# some header lines then data lines:
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=87 packets=1030000i 1488358500000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=87 packets=2430000i 1488358800000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 bytes=400000i 1488355200000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 bytes=400000i 1488356400000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 packets=10000i 1488355200000000000

Bad thing here is that measurement's data fields come in separate rows and in random order.

In my case conversion script just put each measurement data field (packets and bytes) to a separate CSV file (I joined them back later in database). You may need to customize or write your own.

MEASUREMENT=YOUR_MEASUREMENT_NAME
for file in *lineproto.gz ; do
   echo -e "--- $(date) Processing file $file ...."

    for field in packets bytes ; do
       # uncompress, strip some header lines, delete junk chars and measurement name, replace spaces with comma
       gzip -dc ${file} | grep "${MEASUREMENT},device" | grep $field | tr -d a-zA-Z_=- | tr -s ' ' , | cut -b1 --complement >> field_${field}.csv
       echo -e "Conversion for $db field ${field} done"
    done
    echo -e "--- File $file processed at $(date)"
done

Upvotes: 4

How to export large number of rows from InfluxDB to CSV using python client?

Answers (1)

Related Questions