Reputation: 1093
I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.
Say we have a tags like these that we have parsed with BS:
<td>Some Table Data</td>
<td></td>
The official documented way to extract the data is soup.string
. However this extracted a NoneType for the second <td>
tag. So I tried soup.text
(because why not?) and it extracted an empty string exactly as I wanted.
However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?
BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.
Upvotes: 48
Views: 37357
Reputation: 9657
.string
on a Tag
type object returns a NavigableString
type object. On the other hand, .text
gets all the child strings and return concatenated using the given separator. Return type of .text is unicode
object.
From the documentation, A NavigableString
is just like a Python Unicode
string, except that it also supports some of the features described in Navigating the tree and Searching the tree.
From the documentation on .string
, we can see that, If the html is like this,
<td>Some Table Data</td>
<td></td>
Then, .string
on the second td will return None
.
But .text
will return an empty string, which is a unicode
type object.
For greater convenience,
string
tag
to get the single string within this tag.tag
has a single string child then the return value is that string.tag
has no children or more than one child then the return value is None
tag
has one child tag then the return value is the 'string' attribute of the child tag, recursively.And text
If the html
is like this:
<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>
.string
on the four td
will return,
some text
None
more text
None
.text
will give result like this,
some text
more text
even more text
Upvotes: 81
Reputation: 131
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
example:
<td>sometext<p>sometext</p></td>
The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext
Upvotes: 7
Reputation:
The element
<td></td>
does not contain an empty string. It is equivalent to
<td/>
which has no child. For XML, "no text" and "zero length text" is the same.
So soup.string
is correct to return NoneType
.
See also How to create an XML text node with an empty string value (in Java)
Upvotes: 0