Reputation: 1395
I am trying to parse out the contents of two different tags in a txt file and I am getting all the instances of the first tag "p" but not the second "l". Is the problem with the "or"?
Thanks for the help. Here is the code I am using
with open('standardA00456.txt','w') as output_file:
with open('standardA00456.txt','r') as open_file:
the_whole_file = open_file.read()
start_position = 0
while True:
start_position = the_whole_file.find('<p>' or '<l>', start_position)
end_position = the_whole_file.find('</p>' or '</l>', start_position)
data = the_whole_file[start_position:end_position+5]
output_file.write(data + "\n")
start_position = end_position
Upvotes: 1
Views: 1660
Reputation: 7671
'<p>' or '<l>'
will always equal '<p>'
, as it tells Python to use '<l>'
only if '<p>'
is None
, False
, numeric zero, or empty. And as the string '<p>'
is never one of those, '<l>'
is always skipped:
>>> '<p>' or '<l>'
'<p>'
>>> None or '<l>'
'<l>'
Instead you can easily use re.findall
:
import re
with open('standardA00456.txt','w') as out_f, open('standardA00456.txt','r') as open_f:
p_or_ls = re.findall(r'(?:<p>.*?</p>)|(?:<l>.*?</l>)',
open_f.read(),
flags=re.DOTALL) #to include newline characters
for p_or_l in p_or_ls:
out_f.write(p_or_l + "\n")
However, parsing files with tags (such as HTML and XML) using regex is not a good idea. Using a module, such as BeautifulSoup is safer:
from bs4 import BeautifulSoup
with open('standardA00456.txt','w') as out_f, open('standardA00456.txt','r') as open_f:
soup = BeautifulSoup(open_f.read())
for p_or_l in soup.find_all(["p", "l"]):
out_f.write(p_or_l + "\n")
Upvotes: 1
Reputation: 739
English Grad, I think you need to improve the logic. I modified your code and came up with this:
with open('standardA00456.txt','w') as output_file:
with open('standardA00456.txt','r') as open_file:
the_whole_file = open_file.read()
start_position = 0
found_p = False
fould_l = False
while True:
start_pos_p = the_whole_file.find('<p>', start_position)
start_pos_l = the_whole_file.find('<l>', start_position)
if start_pos_p > -1 and start_pos_l > -1:
if start_pos_p < start_pos_l:
found_p = True
start_position = start_pos_p
found_l = False
else:
found_l = True
start_position = start_pos_l
found_p = False
elif start_pos_p > -1:
found_p = True
start_position = start_pos_p
found_l = False
elif start_pos_l > -1:
found_l = True
start_position = start_pos_l
found_p = False
else:
break
if found_p:
end_position = the_whole_file.find('</p>', start_position)
elif found_l:
end_position = the_whole_file.find('</l>', start_position)
else:
break
data = the_whole_file[start_position:end_position+5]
output_file.write(data + "\n")
start_position = end_position
Upvotes: 0