Reputation: 3600
I am parsing an HTML file and would like to match everything between two sequences of characters: Sent:
and the <br>
tag.
I have seen several very similar questions and tried all of their methods and none have worked for me, probably because I'm a novice and am doing something very simple incorrectly.
Here's my relevant code:
for filename in os.listdir(path): #capture email year, month, day
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
a = re.findall(r'Sent:/.+?(?=<br>)/', soup.text)[0]
#a = re.findall(r'Sent:(.*)', soup.text)[0]
print(a)
d = parser.parse(a)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
and I've also tried these for my RegEx: a = re.findall(r'Sent:/^(.*?)<br>/', soup.text)[0]
and a = re.findall(r'Sent:/^[^<br>]*/', soup.text)[0]
But I keep getting the error list index out of range
.... but even when I remove the [0]
I get the error AttributeError: 'list' object has no attribute 'read'
on the line d = parser.parse(a)
.... with only []
printed as a result of print(a)
Here's the relevant block of HTML:
<b>Sent:</b> Friday, June 14, 2013 12:07 PM<br><b>To:</b> David Leveille<br><b>Subject:</b>
Upvotes: 0
Views: 207
Reputation: 2830
The problem is not really your regex, but the fact that BeautifulSoup parse the HTML (its job after all) and change its content. For example, your <br>
will be transformed to <br/>
. Another point : soup.text erases all the tags, so your regex won't work anymore.
It will be more clear trying this script :
from bs4 import *
import re
from dateutil import parser
pattern = re.compile(r'Sent:(.+?)(?=<br/>)')
with open("myfile.html", 'r') as f:
html = f.read()
print("html: ", html)
soup = BeautifulSoup(html, 'lxml')
print("soup.text: ", soup.text)
print("str(soup): ", str(soup))
a = pattern.findall(str(soup))[0]
print("pattern extraction: ", a)
For the second part : since your date string is not formally correct (because of the initial <br/>
), you should add the parameter fuzzy=True
, as its explained in the documentation of dateutil.
d = parser.parse(a, fuzzy=True)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
Another solution would be to use a more precise regex. For example :
pattern = re.compile(r'Sent:</b>(.+?)(?=<br/>)')
Upvotes: 1
Reputation: 922
Can you please replace your regex with the one below that looks for the key terms and then anything between them and tell me what error if any you are now receiving?
a=re.findall(r"Sent:(.*?)<br>", soup.text)[0]
Upvotes: 1
Reputation: 25789
You don't need the usual slash escapes:
a = re.findall(r"Sent:(.*?)<br>", soup.text)[0]
That being said, you should probably check for the output (or at least use try/except) before trying to get a value from it.
Upvotes: 1
Reputation: 1565
Try this. It also takes into consideration if the <br>
tag contains a slash.
/Sent:(.*?)<\/*br>/
Upvotes: 1