Reputation: 6554
i'm trying to parse html content from site with BS4. I got my html fragment, but i need to remove all tags classes, ID's, styles etc.
For example:
<div class="applinks">
<div class="appbuttons">
<a href="https://geo.itunes.apple.com/ru/app/cloud-hub-file-manager-document/id972238010?mt=8&at=11l3Ss" rel="nofollow" target="_blank" title="Cloud Hub - File Manager, Document Reader, Clouds Browser and Download Manager">Загрузить</a>
<span onmouseout="jQuery('.wpappbox-8429dd98d1602dec9a9fc989204dbf7c .qrcode').hide();" onmouseover="jQuery('.wpappbox-8429dd98d1602dec9a9fc989204dbf7c .qrcode').show();">QR-Code</span>
</div>
</div>
i need to get:
<div>
<div>
<a href="https://geo.itunes.apple.com/ru/app/cloud-hub-file-manager-document/id972238010?mt=8&at=11l3Ss" rel="nofollow" target="_blank" title="Cloud Hub - File Manager, Document Reader, Clouds Browser and Download Manager">Загрузить</a>
<span>QR-Code</span>
</div>
</div>
My code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
url = "https://lifehacker.ru/2016/08/29/app-store-29-august-2016/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
post_content = soup.find("div", {"class","post-content"})
print post_content
How i can to remove all tags attributes?
Upvotes: 0
Views: 3870
Reputation: 784
To remove all attributes from the tags in the scrapped data:
import requests
from bs4 import BeautifulSoup
def CleanSoup(content):
for tags in content.findAll(True):
tags.attrs = {}
return content
url = "https://lifehacker.ru/2016/08/29/app-store-29-august-2016/"
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
post_content = soup.find("div", {"class","post-content"})
post_content = CleanSoup(post_content)
Upvotes: 1
Reputation: 1761
import requests
from bs4 import BeautifulSoup
url = "https://lifehacker.ru/2016/08/29/app-store-29-august-2016/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup():
for attribute in ["class"]: # You can also add id,style,etc in the list
del tag[attribute]
Upvotes: 4