Reputation: 54100
I'm doing webscraping using Beautiful Soup. I'm new to it.
Question 1: Here is the Table:
<table width="75%" align=center>
<tr>
<td><STRONG><font face="Arial" size=2>S.No:</font></STRONG></td>
<td><font face="Arial" size=2> 1635925</font></td>
</tr>
<tr>
<td><FONT size=2><STRONG><font face="Arial">Name:</font><br></STRONG></FONT></td>
<td><font face="Arial" size=2> <b>Alex</b></font></td>
</tr>
<tr>
<td><STRONG><font face="Arial" size=2>Dog's Name:</font></STRONG></td>
<td><font face="Arial" size=2> Tiger</font></td>
</tr>
<tr>
<td><STRONG><font face="Arial" size=2 >Cat's Name:</font></STRONG></td>
<td><font face="Arial" size=2>Pussy</font></td>
</tr>
</table>
Here is code referring to above table:
for row in soup('table')[4]('tr'):
tds = row('td')
print tds[0].string, tds[1].string
Here is output:
S.No: 1635925
None None
Dog's Name: Tiger
Cat's Name: Pussy
problem is row 2, Why both of the columns printed None
?
Question 2: Similar problem as above
<tr bgcolor="#ffffff">
<td align="middle"><font face="Arial" size=2>503</font></td>
<td align="left"><font face="Arial" size=2>Text1</font></td>
<td align="left"><font face="Arial" size=2>---</font></td>
<td align="middle"><font face="Arial" size=2>2</font></td>
</tr>
<tr bgcolor="#e6e6fa">
<td colspan=4><font face="Arial" size=2> some random text</font></td>
</tr>
<tr >
<td align="middle"><font face="Arial" size=2>048</font> </td>
<td align="left"><font face="Arial" size=2>Text 2</font></td>
<td align="left"><font face="Arial" size=2>187 </font></td>
<td align="middle"><font face="Arial" size=2>2</font></td>
</tr>
my code:
for row in soup('table')[5]('tr'):
tds = row('td');
if len(tds) == 4:
print tds[0].string, tds[1].string, tds[2].string, tds[3].string
output:
503 Text1 --- 2
None Text2 187 2
Why is the text of first column None
and not 048
?
Upvotes: 1
Views: 135
Reputation: 365717
The problem is that the second row's td
elements don't contain a single element with string contents; they contain two of them. So, string
doesn't have an unambiguous value, and therefore returns None
.
You can see this if you break it down into pieces:
>>> table = s('table')[4]
>>> row = table('tr')[1]
>>> col = row('td')[0]
>>> font = col('font')[0]
>>> strong = font('strong')[0]
>>> font2 = strong('font')[0]
>>> strong
<strong><font face="Arial">Name:</font><br/></strong>
>>> strong.string
>>> font2
<font face="Arial">Name:</font>
>>> font2.string
u'Name:'
If you want the textual representation of all of the strings within an element, use text
instead of string
:
>>> strong.text
u'Name:'
>>> font.text
u'Name:'
>>> col.text
u'Name:'
Upvotes: 1
Reputation: 473873
Give a try to text
instead of string
. E.g.:
for row in soup('table')[4]('tr'):
tds = row('td')
print tds[0].text, tds[1].text
prints:
S.No: 1635925
Name: Alex
Dog's Name: Tiger
Cat's Name: Pussy
According to docs, string
becomes None
if element has multiple childrens:
For your convenience, if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0].
Upvotes: 1