Reputation: 107
I'm trying to extract data from the following url: [https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid][1], I need data to be appended as follows:
[
['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....],
['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....]
]
There are multiple items on the page and I have to get the information of each item as a separate list, as shown above, each item name is given in h2 tag and the related information is provided under 20 headings(dt tag) and respective information is given in dd tag.
Below is my approach:
final_data = []
for g in range(5):
url = df['url_column'][g]
page_source = req.get(url)
soup = bs4.BeautifulSoup(page_source.text,"html5lib")
heading = soup.find_all('h2')
headings = []
for head in heading:
headings.append(head.text)
for i in range(len(headings)-1):
text = soup.find(text=headings[i])
row = []
row.insert(0,df['id'][g])
for d in range(40):
for x in text.findNext(['dt','dd']):
row.append(x) # <--- here's the problem
text = x
final_data.append(row)
print(g, end = ' ')
my problem is, the content under one of the heading (which has a numbered list of string) is getting break into several strings, instead of one string. Due to which when i'm trying to create dataframe by appending all the row lists, it is creating unnecessary columns with br/ tags etc.
I tried changing the x (hinted with text here's the problem in the code) which is a NavigableString to string and replace the unnecessary br/, numbering, periods etc.:
s = str(x) # here's the problem
row.append(s.replace('<dd>|</dd>|<br/>|\d+\.',''))
Any help would be much much appreciated !!!!
Upvotes: 1
Views: 277
Reputation: 195543
I hope I understood your question right, but this script will get all <h2>
, <dt>
and <dd>
tags into structured list:
import requests
from bs4 import BeautifulSoup
url = 'https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for tag in soup.select('h2, dt, dd'):
if tag.name == 'h2':
all_data.append([tag.get_text()])
elif tag.name in ('dt', 'dd'):
all_data[-1].append(tag.get_text(strip=True, separator=' '))
from pprint import pprint
pprint(all_data, width=150)
Prints:
[['Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'About Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Mechanism of Action of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Pharmacokinets of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Onset of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Duration of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Half Life of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Side Effects of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Contra-indications of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Special Precautions while taking Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Pregnancy Related Information',
'N/A',
'Old Age Related Information',
'N/A',
'Breast Feeding Related Information',
...and so on.
Upvotes: 2