Parsing large CSV files quickly

Question

I've been getting the hang of python for the past week, and finally got something working, but could use some help speeding it up

The function takes a .CSV log off a vehicle CAN Bus, and reduces it down to a list of records that match a set of message ID's, and a set of what Message ID's where encountered.

The files are 500,000 lines to 50,000,000 lines. currently it takes about 3.2uS per line in my laptop.

The CSV file lines look like:

Time [s],Packet,Type,Identifier,Control,Data,CRC,ACK
0.210436250000000,0,DATA,0x0CFAE621,0x8,0x02 0x50 0x00 0x00 0x04 0x01 0x00 0x29,0x19A8,NAK
...
...
52.936353750000002,15810,DATA,0x18FC07F4,0x8,0xF0 0x09 0x00 0x00 0xCE 0x03 0x92 0x20,0x0C47,ACK

So the 4th entry "0x0CFAE621" is the Message ID, and the 6th entry "0xF0 0x09 0x00 0x00 0xCE 0x03 0x92 0x20" is the data

This is masked with 0x00FFFF00 and if matched saved as [0xFAE600,'F0','09','00','00','CE','03','92','20'], Though Ideally I would like to convert all that data to int at this point, it seemed very slow wrapping each with int() (was thinking I might be able to improve it with a dict to do the Hex-Int Conversion, but I was not sure on how to do that)

The len() and If tree is because the message data can be empty to 8 records, again I feel there is probably a better way to accomplish this.

from tkinter import filedialog
from tkinter import Tk
import timeit

Tk().withdraw()
filename = filedialog.askopenfile(title="Select .csv log file", filetypes=(("CSV files", "*.csv"), ("all files", "*.*")))

if not filename:
    print("No File Selected")
else:
    CanIdentifiers = set()
    CanRecordData = []
    IdentifierList = {0x00F00100,0x00F00400,0x00FC0800,0x00FE4000,0x00FE4E00,0x00FE5A00,0x00FE6E00,0x00FEC100,0x00FEC300,0x00FECA00,0x00FEF100}
    mask = 0x00FFFF00
    loopcount = 0
    error = 0
    csvtype = 0

    start_time = timeit.default_timer()

    for line in filename.readlines():
        message = line.split(',')

        if csvtype == 1:
            if message[2] == "DATA":
                messageidentifier = int(message[3], 16) & mask
                if messageidentifier not in CanIdentifiers:
                    CanIdentifiers.add(messageidentifier)
                if messageidentifier in IdentifierList:
                    messagedata = message[5].split("0x")
                    size1 = len(messagedata)
                    if size1 == 2:
                        CanRecordData.append((messageidentifier, messagedata[1]))
                    if size1 == 3:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2]))
                    if size1 == 4:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3]))
                    if size1 == 5:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3], messagedata[4]))
                    if size1 == 6:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3], messagedata[4], messagedata[5]))
                    if size1 == 7:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3], messagedata[4], messagedata[5], messagedata[6]))
                    if size1 == 8:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3], messagedata[4], messagedata[5], messagedata[6], messagedata[7]))
                    if size1 == 9:
                        CanRecordData.append((messageidentifier, messagedata[1], messagedata[2], messagedata[3], messagedata[4], messagedata[5], messagedata[6], messagedata[7], messagedata[8]))

        if csvtype == 0:
            if message[0] == "Time [s]":
                csvtype = 1
            error += 1
            if error == 50:
                break
        loopcount += 1

    readtime = (timeit.default_timer() - start_time) * 1000000
    print(loopcount, "Records Processed at", readtime/loopcount, "uS per Record")

John Zwinck · Accepted Answer

Pandas' read_csv() will give you a DataFrame:

    Time [s]  Packet  Type  Identifier Control                                     Data     CRC  ACK
0   0.210436       0  DATA  0x0CFAE621     0x8  0x02 0x50 0x00 0x00 0x04 0x01 0x00 0x29  0x19A8  NAK
1  52.936354   15810  DATA  0x18FC07F4     0x8  0xF0 0x09 0x00 0x00 0xCE 0x03 0x92 0x20  0x0C47  ACK

Then, split the data bytes if you want:

import pandas as pd
df = pd.read_csv('t.csv')
df.Data.str.split(expand=True)

Which gives you:

      0     1     2     3     4     5     6     7
0  0x02  0x50  0x00  0x00  0x04  0x01  0x00  0x29
1  0xF0  0x09  0x00  0x00  0xCE  0x03  0x92  0x20

This will be much, much faster than Python loops, and the storage will be more compact as well--especially if you parse the hex numbers into actual integers: convert pandas dataframe column from hex string to int

Parsing large CSV files quickly

Answers (1)

Related Questions