Abhilash Kr
Abhilash Kr

Reputation: 65

Extracting a custom XML tag

Following is the content of an item tag of an XML file. How can I extract the media:content tag using BeautifulSoup?

<item>
            <title>How Kerala is preparing for monsoon amid the COVID-19 pandemic</title>
            <link/>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007
                  <description>Usually, Kerala begins its procedure for monsoon preparedness by January. This year, however, the officials got busy with preparing for a health crisis instead. “Kerala works six months and fights the monsoon in the other six months,” says Sekhar Kuriakose, member secretary of the Kerala State Disaster Management Authority (KSDMA). Usually, Kerala begins its monsoon preparedness by January, even before the India Meteorological Department (IMD) makes its first long-range forecast for southwe...</description>
            <pubdate>Thu, 21 May 2020 10:30:00 GMT</pubdate>
            <guid>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007</guid>
            <media:content medium="image" url="https://www.thenewsminute.com/sites/default/files/Kerala-rain-trivandrum-1200.jpg" width="600"></media:content>
</item>

Upvotes: 1

Views: 217

Answers (1)

Bernardo Sulzbach
Bernardo Sulzbach

Reputation: 1591

Your issue may be how BS4 handles namespaces with the parser backend you are using. Specifying "LXML" instead of "XML" allows you to use find() and find_all() as you might expect in this case.

Letting t be a string with the XML you provided,

soup = BeautifulSoup(t, "xml")
print(soup.find_all("media:content"))

produces

[]

However, by using the LXML parser, it is able to find the element:

soup = BeautifulSoup(t, "lxml")
print(soup.find_all("media:content"))

produces

[<media:content medium="image" (...)></media:content>]

Upvotes: 2

Related Questions