Reputation: 3191
I'd like to read to a dictionary all of the lines in a text file that come after a particular string. I'd like to do this over thousands of text files.
I'm able to identify and print out the particular string ('Abstract'
) using the following code (gotten from this answer):
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;
But how do I tell Python to start reading the lines that only come after the string?
Upvotes: 21
Views: 71374
Reputation: 180401
Just start another loop when you reach the line you want to start from:
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work
A file object is its own iterator, so when we reach the line with 'Abstract'
in it we continue our iteration from that line until we have consumed the iterator.
A simple example:
gen = (n for n in xrange(8))
for x in gen:
if x == 3:
print('Starting second loop')
for x in gen:
print('In second loop', x)
else:
print('In first loop', x)
Produces:
In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7
You can also use itertools.dropwhile to consume the lines up to the point you want:
from itertools import dropwhile
for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
next(dropped, '')
for line in dropped:
print(line)
Upvotes: 31
Reputation: 3954
To me, the following code is easier to understand.
with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'
Upvotes: 8
Reputation: 279255
Making a guess as to how the dictionary is involved, I'd write it this way:
lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)
So for each file, your dictionary contains a tuple of lines.
This works because the loop reads up to and including the line you identify, leaving the remaining lines in the file ready to be read from f
.
Upvotes: 1
Reputation: 142146
You can use itertools.dropwhile
and itertools.islice
here, a pseudo-example:
from itertools import dropwhile, islice
for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line
Upvotes: 8
Reputation: 17168
Just to clarify, your code already "reads" all the lines. To start "paying attention" to lines after a certain point, you can just set a boolean flag to indicate whether or not lines should be ignored, and check it at each line.
pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True
If you don't mind a little more rearranging of your code, you can also use two partial loops instead: one loop that terminates once you've found your trigger phrase ('Abstract'
), and one that reads all following lines. This approach is a little cleaner (and a very tiny bit faster).
for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line
The reason this works is that the file object returned by open
behaves like a generator, rather than, say, a list: it only produces values as they are requested. So when the first loop stops, the file is left with its internal position set at the beginning of the first "unread" line. This means that when you enter the second loop, the first line you see is the first line after the one that triggered the break
.
Upvotes: 4
Reputation: 5156
Use a boolean to ignore lines up to that point:
found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want
Upvotes: 9