findAll() fails to find any elements within a given parent element

Question

I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.

Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.

I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.

The following sample code is representative of the scenario:

Actual HTML:


            "*some gene information enclosed inside double quotation marks*
        "
        *GENETIC SEQUENCE LINE 1*
        *GENETIC SEQUENCE LINE 2*
        ...
        *GENETIC SEQUENCE LINE N*

My Code Snippets:

The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").

# Assume some predefined gene sequence url, gene_url.

page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
    print span.string

This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.

When I try to find the parent div element using the same procedure as above:

# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
    print div

I get the following print output:

The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:

Has this issue got something to do with the virtualsequence argument in the opening div tag?

How can I achieve my desired aim?

findAll() fails to find any elements within a given parent element

Answers (1)

Related Questions