Extracting a custom XML tag

Question

Following is the content of an item tag of an XML file. How can I extract the media:content tag using BeautifulSoup?


            How Kerala is preparing for monsoon amid the COVID-19 pandemic
            https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007
                  Usually, Kerala begins its procedure for monsoon preparedness by January. This year, however, the officials got busy with preparing for a health crisis instead. “Kerala works six months and fights the monsoon in the other six months,” says Sekhar Kuriakose, member secretary of the Kerala State Disaster Management Authority (KSDMA). Usually, Kerala begins its monsoon preparedness by January, even before the India Meteorological Department (IMD) makes its first long-range forecast for southwe...
            Thu, 21 May 2020 10:30:00 GMT
            https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007

Bernardo Sulzbach · Accepted Answer

Your issue may be how BS4 handles namespaces with the parser backend you are using. Specifying "LXML" instead of "XML" allows you to use find() and find_all() as you might expect in this case.

Letting t be a string with the XML you provided,

soup = BeautifulSoup(t, "xml")
print(soup.find_all("media:content"))

produces

[]

However, by using the LXML parser, it is able to find the element:

soup = BeautifulSoup(t, "lxml")
print(soup.find_all("media:content"))

produces

[]

Extracting a custom XML tag

Answers (1)

Related Questions