tyzion
tyzion

Reputation: 270

How do i scrape the text in the introduction part

I am trying to scrape the text under the introduction part but seem to get a "[]" as an output

THE CODE IS:

import requests
import bs4
import lxml
import html5lib
from bs4 import BeautifulSoup
import re

result=requests.get("https://www.1mg.com/drugs/augmentin-625-duo-tablet-138629")

soup = bs4.BeautifulSoup(result.text,"lxml")

intro=soup.find(text=re.compile('Introduction')).parent.parent.find_all('div', attrs={"class": "DrugOverview__content___22ZBX"})
print(intro)

I am writing the code on sublime text editor and running it on git bash

PS:Do try to give an explanation on how to resolve this cause im a noob at web scraping and cant seem to get the hang of it just yet...thanks

Upvotes: 1

Views: 68

Answers (1)

furas
furas

Reputation: 142641

I found this page checks header User-Agent - maybe it generates different HTML for different devices (phone, tablet, laptop).

But it can't be simple Mozilla/5.0 - it has to be full User-Agent from real web browser.

You can see your User-Agent on https://httpbin.org/get - this page is useful to test what script sends to server.

This code works for me.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"
}
 
url = "https://www.1mg.com/drugs/augmentin-625-duo-tablet-138629"

result = requests.get(url, headers=headers)
                        
soup = BeautifulSoup(result.text, "lxml")

intro = soup.find(text='Introduction').parent.parent.find('div', {"class": "DrugOverview__content___22ZBX"})

text = intro.get_text(strip=True)

print(text)

Upvotes: 2

Related Questions