Reputation: 540
I have a CSV with 13 million lines. The data is not quote encapsulated and it contains newlines, which is causing a row of data to have line breaks. The data does not have multiple breaks per line, only one.
How would I take data like this?
Line of data
Line of data
continuation of previous line of data
Line of data
Line of data
continuation of previous line
Line of data
And turn it into this:
Line of data
Line of data continuation of previous line of data
Line of data
Line of data continuation of previous line
Line of data
I've tested this by storing the line in a variable and processing the next one, looking for the first character to be anything but 'L', and appending it. I've also tried using f.tell()
and f.seek()
to move around in the file, but I haven't been able to get it to work.
Upvotes: 2
Views: 101
Reputation: 550
Assuming every time a line starts with a space it should be concatenated with the preceding line, this should work:
with open(data) as infile:
previous_line = None
for line in infile:
if previous_line is None:
previous_line = line
if line.startswith(' '):
line = previous_line.strip() + line
previous_line = line
print(line.strip())
Upvotes: 3
Reputation: 40904
Here's a cheap, reasonably efficient continuation line joiner for you.
def cont_lines(source):
last_line = ''
for line in source:
if line.startswith(' '):
last_line += line.lstrip() # append a continuation
else:
if last_line:
yield last_line
last_line = line
if last_line: # The one remaining as the source has ended.
yield last_line
Use like this:
with open("tile.csv") as f:
for line in cont_lines(f):
# do something with line
It only uses as much memory as the longest set of continuation lines in your file.
Upvotes: 2
Reputation: 540
I was able to work out something.
infile = "test.txt"
def peek_line(f):
pos = f.tell()
line = f.readline()
f.seek(pos)
return line
f = open(infile, 'r')
while True:
line = f.readline()
if not line:
break
peek = peek_line(f)
if not peek.startswith('T'):
line = (line.strip() + f.readline())
print line,
I'm open to feedback on this method.
Upvotes: 0