Reputation: 31
I am processing XML feed with BeautifulSoup, but from some reason it is skipping part of param tag. I allready tried to change the parser (html.parser / html5lib / lxml), but all have same output.
Can somene help with this?
Original XML file:
<SHOPITEM>
<PRODUCTNO>DK28-SLV</PRODUCTNO>
<PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>
</PARAM>
</SHOPITEM>
Output from BeautifulSoup:
<shopitem>
<productno>DK28-SLV</productno>
<param_name>Způsob komunikace</param_name>
<val>WiFi pro internetové připojení</val>
<param/>
</shopitem>
Desired output:
<shopitem>
<productno>DK28-SLV</productno>
<param> -------> This one is missing
<param_name>Způsob komunikace</param_name>
<val>WiFi pro internetové připojení</val>
<param/>
</shopitem>
My code:
from bs4 import BeautifulSoup
import requests
source = requests.get("my-xml-feed-url").text
soup = BeautifulSoup(source, "lxml")
product = soup.find("shopitem")
for product in soup.find_all("shopitem"):
productno = product.find("productno")
print(productno)
param = product.find("param")
print(param)
param_name = product.find("param_name")
print(param_name)
param_val = product.find("val")
print(param_val)
UPDATE: after testing to change parser to "xml".
It partly helped, and tag is now shown correctly. But XML file is now corrupted on different place. It seems that from approx. 1/2 of XML it is OK, but first 1/2 of XML is not shown..
Original XML:
<PARAM>
<PARAM_NAME>Funkce alarmu</PARAM_NAME>
<VAL>Ano, do mobilní aplikace</VAL>
</PARAM>
Output begining:
/PARAM_NAME>
<VAL>Ano, do mobilní aplikace</VAL>
</PARAM>
This is where output starts.. so from some reason the part of XML before this part is cut off. It seems that there is nothing different in XML structure before and after this point. so I see no reason for this.
Further output is OK:
<PARAM>
<PARAM_NAME>
Úhel záběru
</PARAM_NAME>
<VAL>
60°
</VAL>
</PARAM>
Upvotes: 0
Views: 80
Reputation: 4707
By default BeautifulSoup assumes, that you are parsing HTML. So, it corrupts your XML. You should use "xml" parser like this BeautifulSoup(source, "xml")
By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor:
soup = BeautifulSoup(markup, "xml")
from bs4 import BeautifulSoup
source = """
<SHOPITEM>
<PRODUCTNO>DK28-SLV</PRODUCTNO>
<PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>
</PARAM>
</SHOPITEM>
"""
soup = BeautifulSoup(source, "xml")
product = soup.find("SHOPITEM")
for product in soup.find_all("SHOPITEM"):
productno = product.find("PRODUCTNO")
print(productno)
param = product.find("PARAM")
print(param)
param_name = product.find("PARAM_NAME")
print(param_name)
param_val = product.find("VAL")
print(param_val)
Output
<PRODUCTNO>DK28-SLV</PRODUCTNO>
<PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>
</PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>
Note: XML is Case sensetive, so you need to write tag's name in uppercase.
Upvotes: 2