Reputation: 3402
I need to read lines from a text file but, where the 'end of line' caracter is not always \n or \x or a combination and may be any combination of characters like 'xyz' or '|', but the 'end of line' is always the same and known for each type of file.
As the text file may be a big one and I have to keep performances and memory usage in mind what seems to be the best solution ? Today I use a combinaison of string.read(1000) and split(myendofline) or partition(myendofline) but I would know if a more elegant and standard solution exists.
Upvotes: 1
Views: 751
Reputation: 13416
Obviously simplest would be to just read the whole thing and then call .split('|')
.
However if that's undesirable because it requires you to read the whole thing into memory you might read in arbitrary chunks and perform the split on them. You could write a class that grabs another arbitrary chunk when the current one runs out, and the rest of your application doesn't need to know about it.
Here's the input, zen.txt
The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!
Here's my little test case, that works for me. It doesn't handle a whole bunch corner cases, nor is it particularly pretty, but it should get you started.
class SpecialDelimiters(object):
def __init__(self, filehandle, terminator, chunksize=10):
self.file = filehandle
self.terminator = terminator
self.chunksize = chunksize
self.chunk = ''
self.lines = []
self.done = False
def __iter__(self):
return self
def next(self):
if self.done:
raise StopIteration
try:
return self.lines.pop(0)
except IndexError:
#The lines list is empty, so let's read some more!
while True:
#Looping so even if our chunksize is smaller than one line we get at least one chunk
newchunk = self.file.read(self.chunksize)
self.chunk += newchunk
rawlines = self.chunk.split(self.terminator)
if len(rawlines) > 1 or not newchunk:
#we want to keep going until we have at least one block
#or reached the end of the file
break
self.lines.extend(rawlines[:-1])
self.chunk = rawlines[-1]
try:
return self.lines.pop(0)
except IndexError:
#The end of the road, return last remaining stuff
self.done = True
return self.chunk
zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
print line
Upvotes: 2
Reputation: 27585
Here's a generator function thats acts as an iterator on a file, cuting the lines according exotic newline being identical in all the file.
It reads the file by chunks of lenchunk
characters and displays the lines in each current chunk, chunk after chunk.
Since the newline is 3 characters in my exemple (':;:'), it may happen that a chunk ends with a cut newline: this generator function takes care of this possibility and manages to display the correct lines.
In case of a newline being only one character, the function could be simplified. I wrote only the function for the most delicate case.
Employing this function allows to read a file one line at a time, without reading the entire file into memory.
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
# nl = 0 or 1 acts as 0 or 1 in splitlines()
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
chunk = f.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = f.read(lenchunk)
yield tail
for line in liner('fofo.txt',':;:'):
print line
Here's the same, with printings here and there to allow to follow the algorithm.
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
ch = f.read()
the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
'\nend of the file=='+ch[-50:]+\
'\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
f.seek(0,0)
chunk = f.read(lenchunk)
tail = ''
while chunk:
if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
wr = [' ##########---------- cut newline cut ----------##########'+\
'\nchunk== '+chunk+\
'\n---------------------------------------------------']
else:
wr = ['chunk== '+chunk+\
'\n---------------------------------------------------']
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
wr.append('\nkept== '+kept+\
'\n---------------------------------------------------'+\
'\nnewtail== '+newtail)
chunk = tail + kept
tail = newtail
wr.append('\n---------------------------------------------------'+\
'\ntail + kept== '+chunk+\
'\n---------------------------------------------------')
print ''.join(wr)
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
print '\n\n==================================================='
chunk = f.read(lenchunk)
yield tail
print the_end
for line in liner('fofo.txt',':;:',1):
print 'line== '+line
.
EDIT
I compared the times of execution of my code and of the chmullig's code.
With a 'fofo.txt' file about 10 MB, created with
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
for i in xrange(324000))
with open('fofo.txt','wb') as g:
g.write(ch)
and measuring times like that:
te = clock()
for line in liner('fofo.txt',':;:', 65536):
pass
print clock()-te
fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)
te = clock()
for line in zenBreaker:
pass
print clock()-te
I obtained the following minimum times observed on several essays:
............my code 0,7067 seconds
chmullig's code 0.8373 seconds
.
EDIT 2
I changed my generator function: liner2()
takes a file-handler instead of the file's name. So the opening of the file can be put out of the measuring of time, as it is for the measuring of chmullig's code
def liner2(fh,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
chunk = fh.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = fh.read(lenchunk)
yield tail
fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
pass
print clock()-te
The results, after numerous essays to see the minimum times, are
.........with liner() 0.7067seconds
.......with liner2() 0.7064 seconds
chmullig's code 0.8373 seconds
In fact the opening of the file counts for an infinitesimal part in the total time.
Upvotes: 2
Reputation: 12753
TextFileData.split(EndOfLine_char)
seems to be your solution.
If it's not working fast enough, then you should consider using a lower-level programming level.
Upvotes: 1
Reputation: 188124
Given your contraints, it maybe would be best to convert the known unusual newlines to normal newlines first and then use the usual:
for line in file:
...
Upvotes: 1