Slowat_Kela
Slowat_Kela

Reputation: 1511

Full HTML is not being parsed with BeautifulSoup - is this because of dynamic HTML?

I'm trying to scrape the table on this page.

I can see from the browser debugger that the table I want is there in the HTML. e.g. you can see Peptide Name: this

I wrote this code to extract this table:

for i in range(1001,1003):
#    try:
        res = requests.get("https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=" + str(i))
        soup = BeautifulSoup(res.content, 'html.parser')
        table = soup.find_all('table')
        print table 

But the output that is printed is:

[<table bgcolor="#DAD5BF" border="1" cellpadding="5" width="970"><tr><td align="center">\n\t      This page displays user query in tabular form.\n</td></tr>\n</table>, <table width="970px"><tr><td align="center"><br/><font color="black" size="5px">1001  details</font><br/></td></tr></table>]

Can someone explain why the find_all is not finding all of the tables (and specifically the table I want) and how I can fix this?

Upvotes: 1

Views: 68

Answers (2)

Andersson
Andersson

Reputation: 52665

FYI (If you want to know the root-cause of your issue) target table has invalid markup:

<table class ="tab" cellpadding= "5" ... STYLE="border-spacing: 0px;border-style: line ;
 <tr bgcolor="#DAD5BF"></tr>

Note that starting tag is not closed: <table ... (should be <table ...>) and also ancestor is <div> while the closing tag is </p>

That's why BeautifulSoup doesn't recognize this as a table and thus it's not returned by soup.find_all('table')

However, modern browsers has built-in tools to "fix" broken tags and so in browser table doesn't look "broken": closing </div> is added to ancestor div while p tag transformed into empty node <p></p>

Upvotes: 0

chitown88
chitown88

Reputation: 28564

Not sure why it's not showing.

Since it's a table too, I just went ahead and used Pandas to do .read_html

import pandas as pd

url = 'https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=antitb_1001'

tables = pd.read_html(url)
table = tables[-1]

Output:

print (table)
                           0                                                  1
0        Primary information                                                NaN
1                         ID                                        antitb_1001
2               Peptide Name                                          Polydim-I
3                   Sequence                             AVAGEKLWLLPHLLKMLLTPTP
4    N-terminal Modification                                               Free
5    C-terminal Modification                                               Free
6      Chemical Modification                                               None
7             Linear/ Cyclic                                             Linear
8                     Length                                                 22
9                  Chirality                                                  L
10                    Nature                                        Amphipathic
11                    Source                                            Natural
12                    Origin  Isolated from the venom of the Neotropical was...
13                   Species         Mycobacterium abscessus subsp. massiliense
14                    Strain  Mycobacterium abscessus subsp. massiliense iso...
15  Inhibition Concentartion                                  MIC = 60.8 μg/mL
16          In vitro/In vivo                                               Both
17                 Cell Line  Peritoneal macrophages, J774 macrophages cells...
18  Inhibition Concentartion  Treatment of infected macrophages with 7.6 μg...
19              Cytotoxicity  Non-cytotoxic, 10% cytotoxicity on J774 cells ...
20             In vivo Model  6 to 8 weeks old BALB/c and IFN-γKO (Knockout...
21               Lethal Dose  2 mg/kg/mLW shows 90% reduction in bacterial load
22           Immune Response                                                NaN
23       Mechanism of Action                               Cell wall disruption
24                    Target                                          Cell wall
25       Combination Therapy                                               None
26          Other Activities                                                NaN
27                 Pubmed ID                                           26930596
28       Year of Publication                                               2016
29             3-D Structure                 View in Jmol or Download Structure

Upvotes: 2

Related Questions