user3260279
user3260279

Reputation: 1

findAll() fails to find any elements within a given parent element

I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.

Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.

I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.

The following sample code is representative of the scenario:

Actual HTML:

<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
    <pre>
        "*some gene information enclosed inside double quotation marks*
        "
        <span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 1*</span>
        <span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 2*</span>
        ...
        <span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE N*</span>
    </pre>
</div>

My Code Snippets:

The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").

# Assume some predefined gene sequence url, gene_url.

page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
    print span.string

This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.

When I try to find the parent div element using the same procedure as above:

# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
    print div

I get the following print output:

<div class="seq gbff" id="viewercontent1" sequencesize="*some_sequencesize*" val="*some_val*" virtualsequence=""></div>

The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:

<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">

Has this issue got something to do with the virtualsequence argument in the opening div tag?

How can I achieve my desired aim?

Upvotes: 0

Views: 219

Answers (1)

Pawel Miech
Pawel Miech

Reputation: 7822

Class is a reserved keyword in Python (used when creating objects), so maybe this is causing the trouble, you can try to follow it by underscore and passing it as keyword argument, perhaps this will help:

>>> soup.find_all('span',class_='ff_line')

Check out the docs.

Upvotes: 2

Related Questions