Reputation: 11
Alright, here's what I'm trying to do. I'm fairly new at Python and I'm only just getting to grips with it. Anyway, with this small tool, I'm trying to extract data from a page. In this instance, I want the user to enter a URL and for it to return
<meta content=" % Likes, % Comments - @% on Instagram: “post description []”" name="description" />
However, replace %
with the amount of likes/comments etc that post has.
Here's my full code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('content', attrs={'content': 'description'})
print (result)
But whenever I run it, I'm given []
. What am I doing wrong?
Upvotes: 0
Views: 217
Reputation: 6745
This seems to work:
for tag in soup2.findAll("meta"):
if tag.get("property", None) == "og:description":
print(tag.get("content", None))
Basically, you're looping over all of the tags in the page and looking for ones where the property is "og:description", which seems to be the Open Graph property you want.
Does that help?
The complete version:
from bs4 import BeautifulSoup
import requests
url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('meta', attrs={'content': 'description'})
for tag in soup2.findAll("meta"):
if tag.get("property", None) == "og:description":
print(tag.get("content", None))
Update: Regarding your question about pretty printing this, there are several ways that can be accomplished. One of those ways involves regular expressions and string interpolation. For example:
likes = re.search('(.*)Likes', string).group(1)
comments = re.search(',(.*)Comments', string).group(1)
description = re.search('-(.*)', string).group(1)
print(f"{likes} Likes | {comments} Comments | {description}")
But if you have another question regarding this, it should probably be made in a new post.
Upvotes: 0
Reputation: 782148
The correct way to match those tags is with:
result = soup2.findAll('meta', content=True, attrs={"name": "description"})
However, html.parser
doesn't parse <meta>
tags properly. It doesn't realize they're self-closing, so it's including much of the rest of the <head>
in the result. I changed to
soup2 = BeautifulSoup(page2.content, 'html5lib')
and then the result of the above search was:
[<meta content="46.3m Likes, 2.6m Comments - EGG GANG 🌍 (@world_record_egg) on Instagram: “Let’s set a world record together and get the most liked post on Instagram. Beating the current…”" name="description"/>]
Upvotes: 1