Reputation: 1274
So I'm opening a csv file to parse, but certain lines in the csv are formatted incorrectly. The csv format is typically the following throughout:
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
but at certain points in the csv (since there are more than one associateTix
with an ipaddress
) when there are more than 1 associatedTix
, it gets formatted like the following:
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
'associatedTix','associatedTix''\n'
'associatedTix''\n'
'ipAddress','associatedTix''\n'
'ipAddress','associatedTix''\n'
So what I was going to do to get the csv in the proper format was:
for line in inputCsvFile:
chunks = line.split(",")
if associatedTix in chunks[0]:
#go through the following line's after that line until you find an ip address
#go one line above the line with the ip address
#push that column to the above row, and repeat until you get to the original line3 row with the ip address
The 3 commented lines are the one I'm having trouble coming up with syntax for, so any help determining that syntax would be greatly appreciated. Also, confirmation that my logic will get the csv into the correct format would be appreciated as well.
Upvotes: 0
Views: 2832
Reputation: 2701
As Ignacio has said, there isn't really a problem if you are using the csv
module. If you don't want to use it, use this:
with open("inCSV.txt", "r") as f:
text = f.read()
# Buffer
b = ""
keep_reading = False
for line in text.split("\n"):
if "\"" in line:
# A bunch of tixs are going to appear!
if b == "":
# There are more tixs to read
b += line
# More tixs to come
keep_reading = True
else:
# This is the last tix to read
b += line.replace(",", "")
# Remove newlines, extra whitespace and commas
b = b.translate(None, " ,\n\"")
# Add nice looking whitespace
b = b.replace("E", " E")
b = b.replace(":", ": ")
b = b.replace("I", " I")
b = b.strip()
# Add comma after IP address
ip_index = b.find(" ")
b = b.replace(b[:ip_index + 1], b[:ip_index] + ",")
# No more tixs to read
keep_reading = False
print b
# reset buffer
b = ""
elif keep_reading:
b += line
else:
print line
The advantage to this is, as martineau has said, you do not need to store the whole file in memory.
If you use the csv
module however, you would have to do a bit more manipulation:
import csv
with open("inCSV.txt", "r") as f:
text = csv.reader(f)
for line in text:
# Get associated tix
tix = line[1]
# Remove newlines, extra whitespace and commas
tix = tix.translate(None, " ,\n")
# Add nice looking whitespace
tix = tix.replace("E", " E")
tix = tix.replace(":", ": ")
tix = tix.strip()
line[1] = tix
print line
both will give you:
['248.53.88.234-24', '']
['61.15.168.199-24', '']
['181.140.27.200', '']
['192.128.254.150', '']
['8.160.137.156', 'ESCCB ID#: 90Z-007463']
['136.107.169.150', '']
['165.246.197.229', 'ESCCB ID#: 90Z-009204 ESCCB ID#: 90Z-003262 ESCCB ID#: 90Z-003011 ESCCB ID#: 90Z-001047']
['155.89.77.11', '']
['91.195.188.160', '']
['154.176.191.130', '']
['105.98.164.205', '']
['245.6.16.92', '']
['207.108.19.66', 'ESCCB ID#: 90Z-002345']
['84.71.75.211', 'ESCCB ID#: 90Z-008567 ESCCB ID#: 90Z-006765 ESCCB ID#: 90Z-009384ESCCB ID#: 90Z-001234ESCCB ID#: 90Z-007465']
['33.236.5.19', '']
['127.42.160.158', 'ESCCB ID#: 90Z-002939']
['94.34.104.184', '']
Upvotes: 1
Reputation: 798606
csv
handles fields with newlines properly as long as they are quoted:
$ cat t.csv
136.107.169.150,
165.246.197.229,"ESCCB ID#: 90Z-009204,
ESCCB ID#: 90Z-003262,
ESCCB ID#: 90Z-003011 ESCCB ID#: 90Z-001047"
155.89.77.11,
91.195.188.160,
154.176.191.130,
...
>>> with open('t.csv') as fp:
... read = csv.reader(fp)
... for line in read:
... print line
...
['136.107.169.150', '']
['165.246.197.229', 'ESCCB ID#: 90Z-009204,\nESCCB ID#: 90Z-003262,\nESCCB ID#: 90Z-003011 ESCCB ID#: 90Z-001047']
['155.89.77.11', '']
['91.195.188.160', '']
['154.176.191.130', '']
So the problem you think you have, you actually don't. All you need to do is post-process the second field and then write it back out.
Upvotes: 2