mel
mel

Reputation: 2790

BeautifulSoup: Extract the text that is not in a given tag

I have the following variable, header equal to:

<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>

I want to extract from this variable only the date February 11, 2017. How can I do it using BeautifulSoup in python?

Upvotes: 3

Views: 5133

Answers (1)

Josh Crozier
Josh Crozier

Reputation: 240858

If you know that the date is always the last text node in the header variable, then you could access the .contents property and get the last element in the returned list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.contents[-1].strip()
> February 11, 2017

Or, as MYGz pointed out in the comments below, you could split the text at new lines and retrieve the last element in the list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.text.split('\n')[-1]
> February 11, 2017

If you don't know the position of the date text node, then another option would be to parse out any matching strings:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017

However, as your title implies, if you only want to retrieve text nodes that aren't wrapped with an element tag, then you could use the following which will filter out elements:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

Keep in mind that would return the following since the first text node isn't wrapped:

> ['Andrew Anglin', 'February 11, 2017']

Of course you could also combine the last two options and parse out the date strings in the returned text nodes:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

for node in header:
    if not node.name and node.strip():
        match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
        if match:
            print(match[0])

> February 11, 2017

Upvotes: 5

Related Questions