user10332687
user10332687

Reputation:

csv reader in python3 with mult-character separators

Is there an alternative to using the csv module to read a csv file in python3 in a streaming way? Currently my data looks something like this:

"field1"::"field2"::"field3"\x02\n
"1"::"hi\n"::"3"\x02\n
"8"::"ok"::"3"\x02\n

The separator is two characters, :: (the csv module only accepts a single character separator) and the line separator also contains two characters, \x02\n. Are there any csvreaders that can be used for python in a streaming mode that would be able to support this?

Here is an example of what I'm trying to do:

>>> import csv
>>> s = ''''"field1"::"field2"::"field3"\x02\n\n"1"::"hi\n"::"3"\x02\n\n"8"::"ok"::"3"\x02\n'''
>>> csvreader=csv.reader(s, delimiter='::', lineterminator='\x02\n')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: "delimiter" must be a 1-character string

Loading pandas just to read this csv seems like overkill x 100, so I'd like to see what other options there are.

Upvotes: 2

Views: 898

Answers (2)

Martin Evans
Martin Evans

Reputation: 46779

As you have discovered, the CSV library is not suitable for that data format. You could though pre-parse the data beforehand. For example the following approach should work:

from io import StringIO
import csv

s = '''"field1"::"field2"::"field3"\x02\n\n"1"::"hi\n"::"3"\x02\n\n"8"::"ok"::"3"\x02\n'''

def csv_reader_alt(source):
    return csv.reader((line.replace('\x02', '').replace('::', ':') for line in source), delimiter=':')    

for row in csv_reader_alt(StringIO(s)):
    if row:
        print(row)

Giving you the following output:

['field1', 'field2', 'field3']
['1', 'hi\n', '3']
['8', 'ok', '3']

Upvotes: 1

Ralf
Ralf

Reputation: 16515

@MartinEvans shows a nice way of doing it in his answer.

Here is the code for reading from a file (not from a string in memory) with proper file handling, using a custom delimiter (implemented using a custom generator):

def get_line(file, delimiter='\n', bufsize=4096):
    # https://stackoverflow.com/a/19600562/9225671
    buf = ''
    while True:
        chunk = file.read(bufsize)
        if len(chunk) == 0:
            # end of file has been reached; serve the remaining data and exit
            yield buf
            return

        buf += chunk
        line_list = buf.split(delimiter)

        # don't serve the last part yet, first we need to read more chunks from the file
        buf = line_list.pop(-1)

        for line in line_list:
            yield line

if __name__ == '__main__':
    with open('my_file.csv') as f:
        for line in get_line(f, delimiter='\x02\n'):
            if len(line) > 0:
                parts = line.split('::')
                print(parts)
                print([
                    e.strip('"')
                    for e in parts])

Does that work for you?

Upvotes: 0

Related Questions