Web scraping most frequent names

Question

I am required to web scrape a web page and find the five most frequent names. The expected output should look like

[
    ('Anna Pavlovna', 7), 
    ('the prince', 7), 
    ('the Empress', 3), 
    ('Theprince', 3), 
    ('Prince Vasili', 2),
]

My code does count the most frequent names but the output looks like this instead:

 [(Anna Pavlovna, 7),
 (the prince, 7),
 (the Empress, 3),
 (The prince, 3),
 (Prince Vasili, 2)]

What can I do to make my output look like the sample output?

import nltk

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
soup=BeautifulSoup(html,'html.parser')

nameList = soup.findAll("span", {"class":"green"})  # may use bsObj.find_all()


fdist1 = nltk.FreqDist(nameList)
fdist1.most_common(5)

skywalker · Accepted Answer

The page shows error 502 Bad Gateway, but I think I know what your problem is. When you use findAll it gives you bs4 elements instead of strings. Because of that, you need to convert it to string with something like obj.get_text(). see documentation

items = soup.findAll("span", {"class": "green"})
texts = [item.get_text() for item in items]
# Now you have the texts of the span elements

BTW your code sample is incorrect because bsObj would not be defined.

Web scraping most frequent names

Answers (2)

Related Questions