itsalexlol
itsalexlol

Reputation: 11

Unexpected results when trying to get meta data with BeautifulSoup

Alright, here's what I'm trying to do. I'm fairly new at Python and I'm only just getting to grips with it. Anyway, with this small tool, I'm trying to extract data from a page. In this instance, I want the user to enter a URL and for it to return

<meta content=" % Likes, % Comments - @% on Instagram: “post description []”" name="description" /> 

However, replace % with the amount of likes/comments etc that post has.

Here's my full code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('content', attrs={'content': 'description'})
print (result)

But whenever I run it, I'm given []. What am I doing wrong?

Upvotes: 0

Views: 217

Answers (2)

JoshG
JoshG

Reputation: 6745

This seems to work:

for tag in soup2.findAll("meta"):
    if tag.get("property", None) == "og:description":
        print(tag.get("content", None))

Basically, you're looping over all of the tags in the page and looking for ones where the property is "og:description", which seems to be the Open Graph property you want.

Does that help?

The complete version:

from bs4 import BeautifulSoup
import requests

url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('meta', attrs={'content': 'description'})

for tag in soup2.findAll("meta"):
    if tag.get("property", None) == "og:description":
        print(tag.get("content", None))

Update: Regarding your question about pretty printing this, there are several ways that can be accomplished. One of those ways involves regular expressions and string interpolation. For example:

likes = re.search('(.*)Likes', string).group(1)
comments = re.search(',(.*)Comments', string).group(1)
description = re.search('-(.*)', string).group(1)

print(f"{likes} Likes | {comments} Comments | {description}")

But if you have another question regarding this, it should probably be made in a new post.

Upvotes: 0

Barmar
Barmar

Reputation: 782148

The correct way to match those tags is with:

result = soup2.findAll('meta', content=True, attrs={"name": "description"})

However, html.parser doesn't parse <meta> tags properly. It doesn't realize they're self-closing, so it's including much of the rest of the <head> in the result. I changed to

soup2 = BeautifulSoup(page2.content, 'html5lib')

and then the result of the above search was:

[<meta content="46.3m Likes, 2.6m Comments - EGG GANG 🌍 (@world_record_egg) on Instagram: “Let’s set a world record together and get the most liked post on Instagram. Beating the current…”" name="description"/>]

Upvotes: 1

Related Questions