Reputation: 2790
I have the following variable, header
equal to:
<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>
I want to extract from this variable only the date February 11, 2017
.
How can I do it using BeautifulSoup in python?
Upvotes: 3
Views: 5133
Reputation: 240858
If you know that the date is always the last text node in the header variable, then you could access the .contents
property and get the last element in the returned list:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')
header.contents[-1].strip()
> February 11, 2017
Or, as MYGz pointed out in the comments below, you could split the text at new lines and retrieve the last element in the list:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')
header.text.split('\n')[-1]
> February 11, 2017
If you don't know the position of the date text node, then another option would be to parse out any matching strings:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')
re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017
However, as your title implies, if you only want to retrieve text nodes that aren't wrapped with an element tag, then you could use the following which will filter out elements:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')
text_nodes = [e.strip() for e in header if not e.name and e.strip()]
Keep in mind that would return the following since the first text node isn't wrapped:
> ['Andrew Anglin', 'February 11, 2017']
Of course you could also combine the last two options and parse out the date strings in the returned text nodes:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')
for node in header:
if not node.name and node.strip():
match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
if match:
print(match[0])
> February 11, 2017
Upvotes: 5