Parse CSV file with commas in some columns Python

Question

I have a file with the below example lines:

(22642441022L, u'Click', u'fox, dog, cat are examples http://example.com')
(1153634043, u'Click', u"I learned so much from my mistakes, I think I'm gonna make some more")

I'm trying to parse it to a list of objects with this code:

import csv

file_path = 'Data/example.txt'
data = []

with open(file_path, 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append({'id' : row[0], 'source' : row[1], 'content' : row[2]})

As expected, the content is truncated due to the ',' in the content column. Is there any package that can help me parse this out of the box?

cs95 · Accepted Answer

Looking at your data, someone has dumped the str version of a list into a file as-is, using python2.

One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).

What you can do, is use ast.literal_eval. With python2, this works out of the box.

import ast

data = []
with open('file.txt') as f:
    for line in f:
        try:
            data.append(ast.literal_eval(line))
        except (SyntaxError, ValueError):
            pass

data should look something like this -

[(22642441022L,
  'Click',
  'fox, dog, cat are examples http://example.com'),
 (1153634043,
  'Click',
  "I learned so much from my mistakes, I think I'm gonna make some more")]

You can then pass data into a DataFrame as-is -

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

             A                                       B  \
0  22642441022  Click   
1   1153634043  Click   

                                                   C  
0      fox, dog, cat are examples http://example.com  
1  I learned so much from my mistakes, I think I'...

If you want this to work with python3, you'll need to get rid of the long suffix L, and the unicode prefix u. You might be able to do this using re.sub from the re module.

import re

for line in f:
    try:
        i = re.sub('(\d+)L', r'\1', line)       # remove L suffix
        j = re.sub('(?<=,\s)u(?=\')', '', i)    # remove u prefix
        data.append(ast.literal_eval(j))
    except (SyntaxError, ValueError):
        pass

Notice the added re.sub('(\d+)L', r'\1', line), which removes the L suffix at the end of a string of digits.

Parse CSV file with commas in some columns Python

Answers (2)

Related Questions