Bill Swearingen
Bill Swearingen

Reputation: 664

Cleaning up CSV -- starting a new line

sorry for the stupid question. I am not sure if I am just tired or what, but I am having a hard time trying to figure out the logic of solving this problem.

I have a csv that looks like this:

Company,CompanyName,
Website,WebsiteName ,
Website, WebsiteName2,
Email, emailData,
Company,NextCompanyName,
Website,websiteName,
Website, WebsiteName2,
Company,NextCompanyName,
Name,PersonName,
Website,websiteName,

as you can see, it is pretty nasty data. What I would like to do is read in the entire CSV, and separate each line by CompanyName and try to organize as much data as possible. Sometimes the company has a person's name, sometimes it has multiple websites, sometimes an email, and sometimes not.

So my desired output would be: Company Name, Person's Name, Email Address, Web1, Web2, etc

The good news is that all the data has a separator on each row (Company, Website, Name, etc). What I am wanting to do is read through the CSV, and when it finds a row that looks like Company, CompanyName that it starts a new row and sorts the data (Name to Name Column, email to emailColumn, etc until it runs into another row that looks like Company, CompanyName.

I dont need help reading / writing to the csv. I am looking for help on how to properly iterate over the data and sort the data to where it needs to be.

Thanks for any suggestions you can give me

Upvotes: 1

Views: 65

Answers (2)

tdelaney
tdelaney

Reputation: 77337

You can check for a record start condition as you iterate the lines of the file. Record each key/value pair in a dict and when you see the start, you know the existing record is complete. You can make the values in your record dict a list and append new values as you find them.

from collections import defaultdict
import csv
import re

filename = 'mytest.csv'

# test data
open('mytest.csv', 'w').write("""Company,CompanyName,
Website,WebsiteName ,
Website, WebsiteName2,
Email, emailData,
Company,NextCompanyName,
Website,websiteName,
Website, WebsiteName2,
Company,NextCompanyName,
Name,PersonName,
Website,websiteName,""")

# will hold dict for each company
records = []

with open(filename, newline='') as in_fp:
    record = defaultdict(list)
    for row in csv.reader(in_fp):
        if len(row) >= 2:
            if row[0].strip() == "Company" and "Company" in record:
                # found new company... record old as long as it has data
                records.append(record)
                record = defaultdict(list)
            record[row[0].strip()].append(row[1].strip())

for record in records:
    print('----')
    print(record)

Upvotes: 1

Dogeek
Dogeek

Reputation: 302

You could use a simple condition, and sort everything into lists, or even a single dictionnary (although that is a little more complicated I think, but not much)

companyList = []
with open("foo.csv", "r") as f:
    for line in f:
        if "Company" in line:
            companyList.append(line.split(',')[1])

with a list for each of your rows, then rebuild your csv how you want it to be, and write it.

Upvotes: 0

Related Questions