Reputation: 1263
I am trying to use python and beautiful soup to extract the content part of the tags below:
<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />
I'm getting BeautifulSoup to load the page just fine and find other stuff (this also grabs the article id from the id tag hidden in the source), but I don't know the correct way to search the html and find these bits, I've tried variations of find and findAll to no avail. The code iterates over a list of urls at present...
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup
def get_data(page_no):
webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
soup = BeautifulSoup(webpage, "lxml")
for tag in soup.find_all("article") :
id = tag.get('id')
print id
# the hard part that doesn't work - I know this example is well off the mark!
title = soup.find("og:title", "content")
print (title.get_text())
url = soup.find("og:url", "content")
print (url.get_text())
# end of problem
for i in range (1,100):
get_data(i)
If anyone can help me sort the bit to find the og:title and og:content that'd be fantastic!
Upvotes: 71
Views: 106228
Reputation: 473833
Provide the meta
tag name as the first argument to find()
. Then, use keyword arguments to check the specific attributes:
title = soup.find("meta", property="og:title")
url = soup.find("meta", property="og:url")
print(title["content"] if title else "No meta title given")
print(url["content"] if url else "No meta url given")
The if
/else
checks here would be optional if you know that the title and url meta properties would always be present.
Upvotes: 108
Reputation: 461
This code from Jinesh Narayanan: https://gist.github.com/jineshpaloor/6478011 is valid for this discussion.
from bs4 import BeautifulSoup
import requests
def main():
r = requests.get('http://www.sourcebits.com/')
soup = BeautifulSoup(r.content, features="lxml")
title = soup.title.string
print ('TITLE IS :', title)
meta = soup.find_all('meta')
for tag in meta:
if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
# print ('NAME :',tag.attrs['name'].lower())
print ('CONTENT :',tag.attrs['content'])
if __name__ == '__main__':
main()
Upvotes: 1
Reputation: 10538
You could grab the content inside the meta tag with gazpacho:
from gazpacho import Soup
html = """\
<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />
"""
soup = Soup(html)
soup.find("meta", {"property": "og:title"}).attrs['content']
Which would output:
'Super Fun Event 1'
Upvotes: 1
Reputation: 12018
A way I like to solve this is as follows:
(Is neater when using with lists of properties to look up...)
title = soup.find("meta", {"property":"og:title"})
url = soup.find("meta", {"property":"og:url"})
# Using same method as above answer
title = title["content"] if title else None
url = url["content"] if url else None
Upvotes: 7
Reputation: 19733
try this :
soup = BeautifulSoup(webpage)
for tag in soup.find_all("meta"):
if tag.get("property", None) == "og:title":
print tag.get("content", None)
elif tag.get("property", None) == "og:url":
print tag.get("content", None)
Upvotes: 33