user3306642
user3306642

Reputation: 31

BeautifulSoup missing part of tag

I am processing XML feed with BeautifulSoup, but from some reason it is skipping part of param tag. I allready tried to change the parser (html.parser / html5lib / lxml), but all have same output.

Can somene help with this?

Original XML file:

<SHOPITEM>
 <PRODUCTNO>DK28-SLV</PRODUCTNO>
 <PARAM>
  <PARAM_NAME>Způsob komunikace</PARAM_NAME>
  <VAL>WiFi pro internetové připojení</VAL>
 </PARAM>
</SHOPITEM>

Output from BeautifulSoup:

<shopitem>
 <productno>DK28-SLV</productno> 
 <param_name>Způsob komunikace</param_name>
 <val>WiFi pro internetové připojení</val>
 <param/>
</shopitem>

Desired output:

<shopitem>
 <productno>DK28-SLV</productno> 
 <param>      -------> This one is missing
  <param_name>Způsob komunikace</param_name>
  <val>WiFi pro internetové připojení</val>
 <param/>
</shopitem>

My code:

from bs4 import BeautifulSoup
import requests

source = requests.get("my-xml-feed-url").text
soup = BeautifulSoup(source, "lxml")

product = soup.find("shopitem")


for product in soup.find_all("shopitem"):
    productno = product.find("productno")
    print(productno)
    param = product.find("param")
    print(param)
    param_name = product.find("param_name")
    print(param_name)
    param_val = product.find("val")
    print(param_val)

UPDATE: after testing to change parser to "xml".

It partly helped, and tag is now shown correctly. But XML file is now corrupted on different place. It seems that from approx. 1/2 of XML it is OK, but first 1/2 of XML is not shown..

Original XML:

<PARAM>
<PARAM_NAME>Funkce alarmu</PARAM_NAME>
<VAL>Ano, do mobilní aplikace</VAL>
</PARAM>

Output begining:

/PARAM_NAME> 
<VAL>Ano, do mobilní aplikace</VAL> 
</PARAM>

This is where output starts.. so from some reason the part of XML before this part is cut off. It seems that there is nothing different in XML structure before and after this point. so I see no reason for this.

Further output is OK:

<PARAM>
   <PARAM_NAME>
    Úhel záběru
   </PARAM_NAME>
   <VAL>
    60°
   </VAL>
  </PARAM>

Upvotes: 0

Views: 80

Answers (1)

Yevhen Bondar
Yevhen Bondar

Reputation: 4707

By default BeautifulSoup assumes, that you are parsing HTML. So, it corrupts your XML. You should use "xml" parser like this BeautifulSoup(source, "xml")

Docs

By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor:

soup = BeautifulSoup(markup, "xml")


from bs4 import BeautifulSoup

source = """
<SHOPITEM>
 <PRODUCTNO>DK28-SLV</PRODUCTNO>
 <PARAM>
  <PARAM_NAME>Způsob komunikace</PARAM_NAME>
  <VAL>WiFi pro internetové připojení</VAL>
 </PARAM>
</SHOPITEM>
"""
soup = BeautifulSoup(source, "xml")


product = soup.find("SHOPITEM")


for product in soup.find_all("SHOPITEM"):
    productno = product.find("PRODUCTNO")
    print(productno)
    param = product.find("PARAM")
    print(param)
    param_name = product.find("PARAM_NAME")
    print(param_name)
    param_val = product.find("VAL")
    print(param_val)

Output

<PRODUCTNO>DK28-SLV</PRODUCTNO>
<PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>
</PARAM>
<PARAM_NAME>Způsob komunikace</PARAM_NAME>
<VAL>WiFi pro internetové připojení</VAL>

Note: XML is Case sensetive, so you need to write tag's name in uppercase.

Upvotes: 2

Related Questions