Reputation: 21
I want to extract elements from a specific tag. For example - there are four in a site. And each tag has other siblings tags like p,h3,h4,ul and so on. I want to see h2[1] elements, h2[2] elements separately.
This is what I have done so far. I know for loop doesn't make any sense. I also tried to append text but couldn't make it successful. Then I tried searching by a specific string, but it gives the only tag of that specific string, not all other elements
from bs4 import BeautifulSoup
page = "https://www.us-cert.gov/ics/advisories/icsma-20-079-01"
resp = requests.get(page)
soup = BeautifulSoup(resp.content, "html5lib")
content_div=soup.find('div', {"class": "content"})
all_p= content_div.find_all('p')
all_h2=content_div.find_all('h2')
i=0
for h2 in all_h2:
print(all_h2[i],'\n\n')
print(all_p[i],'\n')
i=i+1
Also tried using append
tags = soup.find_all('div', {"class": "content"})
container = []
for tag in tags:
try:
container.append(tag.text)
print(tag.text)
except:
print(tag)
I am a total newbie in programming. Please pardon my poor coding skills. All I want is to see everything under "mitigation" together. So that if I want to store it in DB it will parse everything related to mitigation on one column.
Upvotes: 1
Views: 106
Reputation: 45372
You can look for a static list of tags ["p","ul","h2","div"]
using findNext
with recursive=False
to stay on the top level :
import requests
from bs4 import BeautifulSoup
import json
resp = requests.get("https://www.us-cert.gov/ics/advisories/icsma-20-079-01")
soup = BeautifulSoup(resp.content, "html.parser")
content_div = soup.find('div', {"class": "content"})
h2_list = [ i for i in content_div.find_all("h2")]
result = []
search_tags = ["p","ul","h2","div"]
def getChildren(tag):
text = []
while (tag):
tag = tag.findNext(search_tags, recursive=False)
if (tag is None):
break
elif (tag.name == "div") or (tag.name == "h2"):
break
else:
text.append(tag.text.strip())
return "".join(text)
for i in h2_list:
result.append({
"name": i.text.strip(),
"children": getChildren(i)
})
print(json.dumps(result, indent=4, sort_keys=True))
Upvotes: 1