Mokhtar Ashour
Mokhtar Ashour

Reputation: 600

Parse CSV file with commas in some columns Python

I have a file with the below example lines:

(22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com')
(1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")

I'm trying to parse it to a list of objects with this code:

import csv

file_path = 'Data/example.txt'
data = []

with open(file_path, 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append({'id' : row[0], 'source' : row[1], 'content' : row[2]})

As expected, the content is truncated due to the ',' in the content column. Is there any package that can help me parse this out of the box?

Upvotes: 1

Views: 725

Answers (2)

cs95
cs95

Reputation: 402333

Looking at your data, someone has dumped the str version of a list into a file as-is, using python2.

One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).

What you can do, is use ast.literal_eval. With python2, this works out of the box.

import ast

data = []
with open('file.txt') as f:
    for line in f:
        try:
            data.append(ast.literal_eval(line))
        except (SyntaxError, ValueError):
            pass

data should look something like this -

[(22642441022L,
  '<a href="http://example.com">Click</a>',
  'fox, dog, cat are examples http://example.com'),
 (1153634043,
  '<a href="http://example.com">Click</a>',
  "I learned so much from my mistakes, I think I'm gonna make some more")]

You can then pass data into a DataFrame as-is -

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

             A                                       B  \
0  22642441022  <a href="http://example.com">Click</a>   
1   1153634043  <a href="http://example.com">Click</a>   

                                                   C  
0      fox, dog, cat are examples http://example.com  
1  I learned so much from my mistakes, I think I'...  

If you want this to work with python3, you'll need to get rid of the long suffix L, and the unicode prefix u. You might be able to do this using re.sub from the re module.

import re
for line in f:
    try:
        i = re.sub('(\d+)L', r'\1', line)       # remove L suffix
        j = re.sub('(?<=,\s)u(?=\')', '', i)    # remove u prefix
        data.append(ast.literal_eval(j))
    except (SyntaxError, ValueError):
        pass

Notice the added re.sub('(\d+)L', r'\1', line), which removes the L suffix at the end of a string of digits.

Upvotes: 2

Savir
Savir

Reputation: 18418

So it looks like the file was generated doing something like this (a pure dump of a Python str() or print):

data_list = [
    (22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com'),
    (1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")
]  # List of tuples

with open('./stack_084.txt', 'w') as f:
    f.write('\n'.join([str(data) for data in data_list]))

Regular expressions come to mind (assuming that the values on your second "column") always start with <a and end with a>:

import pprint
import re

line_re = re.compile(
    r'\('
    r'(?P<num>\d+)L{0,1}.'
    r'+?'
    r'[\'\"](?P<source>\<a.+?a\>)[\"\']'
    r'.+?'
    r'[\'\"](?P<content>.+?)[\"\']'
    r'\)'
)

data = []
with open('./stack_084.txt', 'r') as f:
    for line in f:
        match = line_re.match(line)
        if match:
            data.append({
                'id': int(match.groupdict()['num']),
                'source': match.groupdict()['source'],
                'content': match.groupdict()['content']
            })

# You should see parsed data here:
print(pprint.pformat(data))

This outputs:

[{'content': 'fox, dog, cat are examples http://example.com',
  'id': 22642441022,
  'source': '<a href="http://example.com">Click</a>'},
 {'content': "I learned so much from my mistakes, I think I'm gonna make some "
             'more',
  'id': 1153634043,
  'source': '<a href="http://example.com">Click</a>'}]

Upvotes: 1

Related Questions