Python: eliminate extra comma (Error tokenizing data. C error: Expected 3 fields in line 29, saw 4)

Question

The error cause by 'Food, Beverage & Tobacco' which has extra comma that cause pandas unable to read the csv file. it cause error

Error tokenizing data. C error: Expected 3 fields in line 29, saw 4

How can I elegantly eliminate extra comma in the csv file for 'GICS industry group'(including condition beside the comma is behind Food)?

Here is my code:

#!/usr/bin/env python2.7
print "hello from python 2"

import pandas as pd
from lxml import html
import requests
import urllib2
import os


url = 'http://www.asx.com.au/asx/research/ASXListedCompanies.csv'

response = urllib2.urlopen(url)
html = response.read()
#html = html.replace('"','')

with open('asxtest.csv', 'wb') as f:
    f.write(html)

with open("asxtest.csv",'r') as f:
    with open("asx.csv",'w') as f1:
        f.next()#skip header line
        f.next()#skip 2nd line
        for line in f:
             if line.count(',')>2:
                 line[2] = 'Food Beverage & Tobacco'
             f1.write(line)

os.remove('asxtest.csv')

df_api = pd.read_csv('asx.csv')
df_api.rename(columns={'Company name': 'Company', 'ASX code': 'Stock','GICS industry group': 'Industry'}, inplace=True)

James · Accepted Answer

The file from the URL in your post contains additional commas for some items in the GICS industry group column. The first occurs at line 31 in the file:

ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage & Tobacco

Normally, the 3rd item should be surrounded by quotes to escape breaking on the comma, such as:

ABUNDANT PRODUCE LIMITED,ABT,"Food, Beverage & Tobacco"

For this situation, because the first 2 columns appear to be clean, you can merge any additional text into the 3rd field. After this cleaning, load it into a data frame.

You can do this with a generator that will pull out and clean each line one at a time. The pd.DataFrame constructor will read in the data and create a data frame.

import pandas as pd

def merge_last(file_name, skip_lines=0):
    with open(file_name, 'r') as fp:
        for i, line in enumerate(fp):
            if i < 2:
                continue
            x, y, *z = line.strip().split(',')
            yield (x,y,','.join(z))

# create a generator to clean the lines, skipping the first 2
gen = merge_last('ASXListedCompanies.csv', 2)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)

df.head()

returns:

          Company name ASX code                 GICS industry group
0          MOQ LIMITED      MOQ                 Software & Services
1       1-PAGE LIMITED      1PG                 Software & Services
2  1300 SMILES LIMITED      ONT    Health Care Equipment & Services
3    1ST GROUP LIMITED      1ST    Health Care Equipment & Services
4         333D LIMITED      T3D  Commercial & Professional Services

And the rows with the extra commas are preserved:

df.loc[27:30]
# returns:
                           Company name ASX code       GICS industry group
27             ABUNDANT PRODUCE LIMITED      ABT  Food, Beverage & Tobacco
28                  ACACIA COAL LIMITED      AJC                    Energy
29  ACADEMIES AUSTRALASIA GROUP LIMITED      AKG         Consumer Services
30         ACCELERATE RESOURCES LIMITED      AX8                Class Pend

Here is a more generalized generator that will merge after a given number of columns:

def merge_last(file_name, merge_after_col=2, skip_lines=0):
    with open(file_name, 'r') as fp:
        for i, line in enumerate(fp):
            if i < 2:
                continue
            spl = line.strip().split(',')
            yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:]))

Python: eliminate extra comma (Error tokenizing data. C error: Expected 3 fields in line 29, saw 4)

Answers (1)

Related Questions