thecrazyones
thecrazyones

Reputation: 11

Trying to read csv file in python and creating separate table

import numpy as np
import pandas as pd

Trying to read a csv file using pandas This is the data that I scraped. Please note that there are Brackets start and end [](Maybe its a list). What should I write so entire data to be in table form? I don't know how to separate Brackets from the data.

[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of 

How should I read CSV file? I want all the data in a table form. Please help

Upvotes: 1

Views: 127

Answers (2)

haofeng
haofeng

Reputation: 651

the basic method to read file.csv.

def process(string):
  print("Processing:",string)

data = []
for line in open("file.csv"):
  process(string)
  line = line.replace("\n","")
  process_code()

Upvotes: -1

CryptoFool
CryptoFool

Reputation: 23119

Your problem is exactly what the error message is telling you it is. The error is in parsing this line:

['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']

The code ignores quote characters and breaks the line up into fields, making a break wherever it finds the delimiter ", ". You're expecting this to be a single field:

The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics

but this "field" has an instance of the delimiter ", " in it, which the CSV parser will honor because it is ignoring the fact that you have this value in quotes. So this piece of data is broken into two fields:

['The University of Alabama (Master of Science in Marketing

and

Specialization in Marketing Analytics)'

This results in the line being broken into 7 fields, and your code is expecting only 6.

Note that in addition, your items are going to include the quotes, which may not be what you're expecting either, and those square braces don't belong there. In short, this isn't a well formed CSV file.

UPDATE: I'm a regex weenie. I do everything with regex expressions, and can't ignore a challenge like this. Here's a regex-based solution that will read exactly what you want out of this data. If you want it to recognize the last line of your data, you should add "']" to the end of that line.

import regex
from pprint import pprint

def parse_file(file):
    linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
    with open(file) as f:
        r = []
        while True:
            line = f.readline()
            if  not line:
                break
            line = line.strip()
            if len(line) == 0:
                continue
            m = linepat.match(line)
            if m and m.captures(4):
                fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
                r.append(fields)
    return r

def main():
    r = parse_file("/tmp/blah.csv")
    pprint(r)

main()

Result:

[['Auburn University (Online Master of Business Administration with '
  'concentration in Business Analytics)',
  'Masters',
  'US',
  'AL',
  '/Campus',
  'Raymond J. Harbert College of Business'],
 ...
 ['University of Arkansas (Professional Master of Information Systems)',
  'Masters',
  'US',
  'AR',
  '/Campus',
  'Sam M. Walton College of']]

Note that this doesn't use the built-in 're' module. That module doesn't deal with repeating groups, which is a must for this kind of problem. Also note that this doesn't involve Pandas. I don't know anything about that module, I assume it is trivial to feed the clean, parsed data from this code into Pandas if that's where you really want it.

Upvotes: 2

Related Questions