Tyler Bell
Tyler Bell

Reputation: 897

Python Beautiful Soup Extracting HTML Meta Data

I am getting some odd behavior that I do not quite understand. I am hoping someone can explain what is going on.

Consider this metadata:

<meta property="og:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">

This line successfully finds ALL "og" properties and returns a list.

opengraphs = doc.html.head.findAll(property=re.compile(r'^og'))

However, this line fails to do the same thing for the twitter cards.

twitterCards = doc.html.head.findAll(name=re.compile(r'^twitter'))

Why does the first line successfully find all the "og" (opengraph cards), but fail to find the twitter cards?

Upvotes: 2

Views: 1405

Answers (2)

furas
furas

Reputation: 142919

Problem is name= which has special meaning. It is used to find tag name - in your code it is meta

You have to add "meta" and use dictionary with "name"

Example with different items.

from bs4 import BeautifulSoup
import re

data='''
<meta property="og:title" content="This is the Tesla Semi truck">
<meta property="twitter:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">
'''

head = BeautifulSoup(data)

print(head.findAll(property=re.compile(r'^og'))) # OK
print(head.findAll(property=re.compile(r'^tw'))) # OK

print(head.findAll(name=re.compile(r'^meta'))) # OK
print(head.findAll(name=re.compile(r'^tw')))   # empty

print(head.findAll('meta', {'name': re.compile(r'^tw')})) # OK

Upvotes: 5

alecxe
alecxe

Reputation: 474131

This is because name is the name of the tag name argument which basically means that in this case BeautifulSoup would look for elements with tag names that start with twitter.

In order to specify that you actually mean an attribute, use:

doc.html.head.find_all(attrs={'name': re.compile(r'^twitter')})

Or, via a CSS selector:

doc.html.head.select("[name^=twitter]")

where ^= means "starts with".

Upvotes: 3

Related Questions