Dyna
Dyna

Reputation: 71

Help parsing between <pre> tags using BeautifulSoup

I am attempint to parse out information from a website using BeautifulSoup and python. The html looks like the following. I am wanting my parsed data to look like:

ID Definition
Lysine.biosynthesis - Burkholderia psuedomallei 17
... rest of data in similar place (within the "pre" tags and outside the "a" tags.

How can I do this?

<pre>ID                   Definition
    ----------------------------------------------------------------------------------------------------
<a href="/kegg-bin/show_pathway?bpm00300">bpm00300</a>             Lysine biosynthesis - Burkholderia pseudomallei 17 
<a href="/kegg-bin/show_pathway?bpm00330">bpm00330</a>             Arginine and proline metabolism - Burkholderia pse 
<a href="/kegg-bin/show_pathway?bpm01100">bpm01100</a>             Metabolic pathways - Burkholderia pseudomallei 171 
<a href="/kegg-bin/show_pathway?bpm01110">bpm01110</a>             Biosynthesis of secondary metabolites - Burkholder 
</pre>

I have tried by:

y=soup.find('pre') #returns data between <pre> tags. Specific to KEGG
    for a in y:
        z =a.string

This gave me:

 ID                   Definition
----------------------------------------------------------------------------------------------------

Thanks for the help!

Upvotes: 2

Views: 2643

Answers (1)

smci
smci

Reputation: 33940

BeautifulSoup() and its search methods return you a hierarchical parse-tree object, not just a string. Iterating through findChildren() on the node found does what you want (and skips the header line):

for a in soup.find('pre').findChildren():
    z = a.string

Upvotes: 1

Related Questions