Python: Parsing SGML

Question

I'm trying to parse some SGML like the following in Python:



    One
    Sample One


    Two
    Sample Two

Here, I'm just looking for everything inside the tags (i.e. ["Sample One", "Sample Two"]).

I've tried using BeautifulSoup, but it doesn't like the in the first line and also expects everything to be wrapped around a root tag like . While I can manually make these changes before passing it into BeautifulSoup, it feels a bit too hacky.

I'm pretty new to SGML, and also not married to BeautifulSoup, so I'm open to any suggestions.

(For those curious: my specific usecase is the reuters21578 dataset.)

Python: Parsing SGML

Answers (1)

Related Questions