pHorseSpec
pHorseSpec

Reputation: 1274

Fix a CSV File's Format

So I'm opening a csv file to parse, but certain lines in the csv are formatted incorrectly. The csv format is typically the following throughout:

'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'

but at certain points in the csv (since there are more than one associateTix with an ipaddress) when there are more than 1 associatedTix, it gets formatted like the following:

'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'associatedTix','associatedTix''\n'
'associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'

So what I was going to do to get the csv in the proper format was:

for line in inputCsvFile:
    chunks = line.split(",")
        if associatedTix in chunks[0]:
            #go through the following line's after that line until you find an ip address
            #go one line above the line with the ip address
            #push that column to the above row, and repeat until you get to the original line3 row with the ip address

The 3 commented lines are the one I'm having trouble coming up with syntax for, so any help determining that syntax would be greatly appreciated. Also, confirmation that my logic will get the csv into the correct format would be appreciated as well.

Upvotes: 0

Views: 2832

Answers (2)

Moon Cheesez
Moon Cheesez

Reputation: 2701

As Ignacio has said, there isn't really a problem if you are using the csv module. If you don't want to use it, use this:

with open("inCSV.txt", "r") as f:
    text = f.read()
    # Buffer
    b = ""
    keep_reading = False
    for line in text.split("\n"):
        if "\"" in line:
            # A bunch of tixs are going to appear!
            if b == "":
                # There are more tixs to read
                b += line
                # More tixs to come
                keep_reading = True
            else:
                # This is the last tix to read
                b += line.replace(",", "")
                # Remove newlines, extra whitespace and commas
                b = b.translate(None, " ,\n\"")
                # Add nice looking whitespace
                b = b.replace("E", " E")
                b = b.replace(":", ": ")
                b = b.replace("I", " I")
                b = b.strip()
                # Add comma after IP address
                ip_index = b.find(" ")
                b = b.replace(b[:ip_index + 1], b[:ip_index] + ",")
                # No more tixs to read
                keep_reading = False

                print b
                # reset buffer
                b = ""
        elif keep_reading:
            b += line
        else:
            print line

The advantage to this is, as martineau has said, you do not need to store the whole file in memory.

If you use the csv module however, you would have to do a bit more manipulation:

import csv
with open("inCSV.txt", "r") as f:
    text = csv.reader(f)
    for line in text:
        # Get associated tix
        tix = line[1]
        # Remove newlines, extra whitespace and commas
        tix = tix.translate(None, " ,\n")
        # Add nice looking whitespace
        tix = tix.replace("E", " E")
        tix = tix.replace(":", ": ")
        tix = tix.strip()

        line[1] = tix
        print line

both will give you:

['248.53.88.234-24', '']
['61.15.168.199-24', '']
['181.140.27.200', '']
['192.128.254.150', '']
['8.160.137.156', 'ESCCB ID#: 90Z-007463']
['136.107.169.150', '']
['165.246.197.229', 'ESCCB ID#: 90Z-009204 ESCCB ID#: 90Z-003262 ESCCB ID#: 90Z-003011 ESCCB ID#: 90Z-001047']
['155.89.77.11', '']
['91.195.188.160', '']
['154.176.191.130', '']
['105.98.164.205', '']
['245.6.16.92', '']
['207.108.19.66', 'ESCCB ID#: 90Z-002345']
['84.71.75.211', 'ESCCB ID#: 90Z-008567 ESCCB ID#: 90Z-006765 ESCCB ID#: 90Z-009384ESCCB ID#: 90Z-001234ESCCB ID#: 90Z-007465']
['33.236.5.19', '']
['127.42.160.158', 'ESCCB ID#: 90Z-002939']
['94.34.104.184', '']

Upvotes: 1

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798606

csv handles fields with newlines properly as long as they are quoted:

$ cat t.csv
136.107.169.150,
165.246.197.229,"ESCCB ID#: 90Z-009204,
ESCCB ID#: 90Z-003262,
ESCCB ID#: 90Z-003011                   ESCCB ID#: 90Z-001047"
155.89.77.11,
91.195.188.160,
154.176.191.130,

...

>>> with open('t.csv') as fp:
...   read = csv.reader(fp)
...   for line in read:
...     print line
... 
['136.107.169.150', '']
['165.246.197.229', 'ESCCB ID#: 90Z-009204,\nESCCB ID#: 90Z-003262,\nESCCB ID#: 90Z-003011                   ESCCB ID#: 90Z-001047']
['155.89.77.11', '']
['91.195.188.160', '']
['154.176.191.130', '']

So the problem you think you have, you actually don't. All you need to do is post-process the second field and then write it back out.

Upvotes: 2

Related Questions