Reputation: 270
I am trying to scrape the text under the introduction part but seem to get a "[]"
as an output
THE CODE IS:
import requests
import bs4
import lxml
import html5lib
from bs4 import BeautifulSoup
import re
result=requests.get("https://www.1mg.com/drugs/augmentin-625-duo-tablet-138629")
soup = bs4.BeautifulSoup(result.text,"lxml")
intro=soup.find(text=re.compile('Introduction')).parent.parent.find_all('div', attrs={"class": "DrugOverview__content___22ZBX"})
print(intro)
I am writing the code on sublime text editor and running it on git bash
PS:Do try to give an explanation on how to resolve this cause im a noob at web scraping and cant seem to get the hang of it just yet...thanks
Upvotes: 1
Views: 68
Reputation: 142641
I found this page checks header User-Agent
- maybe it generates different HTML
for different devices (phone, tablet, laptop).
But it can't be simple Mozilla/5.0
- it has to be full User-Agent
from real web browser.
You can see your User-Agent
on https://httpbin.org/get - this page is useful to test what script sends to server.
This code works for me.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"
}
url = "https://www.1mg.com/drugs/augmentin-625-duo-tablet-138629"
result = requests.get(url, headers=headers)
soup = BeautifulSoup(result.text, "lxml")
intro = soup.find(text='Introduction').parent.parent.find('div', {"class": "DrugOverview__content___22ZBX"})
text = intro.get_text(strip=True)
print(text)
Upvotes: 2