Reputation: 51
I am learning Data science and while doing a problem, I came across a weird observation. The problem was to print the number of occurrences of the string 'Soup' on the Beautiful Soup home page, using python. The weird part is, the number of occurrences varies in the iPython notebook and in Python and when I ran a manual search on the webpage the result was entirely different.
I'd love if someone could give a plausible explanation. I have attached along, the code snippets and the results:
In Python
In Pandas
Manually
As you can see the result varies in all the environments, it shows 39 occurrences in Python, 41 in Pandas and 35 via manual search.
Thanks
Upvotes: 1
Views: 55
Reputation: 863481
I think Python
found only 39
, because 2
missing are in <head>
:
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<meta name="Description" content="Beautiful Soup: a library designed for screen-scraping HTML and XML.">
You can check it by Source of the page
- there are 41
occurrences.
If check webpage
manually (35 occurences), 4 are in urls
and 2
in <head>
, so together 41
:
<a href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Here's
the Beautiful Soup 3 documentation.</a>
<a href="download/3.x/BeautifulSoup-3.2.1.tar.gz">3.2.1</a>
<a href="/source/software/BeautifulSoup/index.bhtml">
<a href="http://www.crummy.com/software/BeautifulSoup/">
Upvotes: 3