dxb625
dxb625

Reputation: 63

Web scraping most frequent names

I am required to web scrape a web page and find the five most frequent names. The expected output should look like

[
    ('Anna Pavlovna', 7), 
    ('the prince', 7), 
    ('the Empress', 3), 
    ('Theprince', 3), 
    ('Prince Vasili', 2),
]

My code does count the most frequent names but the output looks like this instead:

 [(<span class="green">Anna Pavlovna</span>, 7),
 (<span class="green">the prince</span>, 7),
 (<span class="green">the Empress</span>, 3),
 (<span class="green">The prince</span>, 3),
 (<span class="green">Prince Vasili</span>, 2)]

What can I do to make my output look like the sample output?

import nltk

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
soup=BeautifulSoup(html,'html.parser')

nameList = soup.findAll("span", {"class":"green"})  # may use bsObj.find_all()


fdist1 = nltk.FreqDist(nameList)
fdist1.most_common(5)

Upvotes: 2

Views: 126

Answers (2)

Mohamed Benkedadra
Mohamed Benkedadra

Reputation: 2084

Just change this line :

nameList = soup.findAll("span", {"class":"green"})

to this :

nameList = [tag.text for tag in soup.findAll("span", {"class":"green"})]

the findAll function returns a list of tags, to get the text inside the tags you use the text property.

Upvotes: 1

skywalker
skywalker

Reputation: 1300

The page shows error 502 Bad Gateway, but I think I know what your problem is. When you use findAll it gives you bs4 elements instead of strings. Because of that, you need to convert it to string with something like obj.get_text(). see documentation

items = soup.findAll("span", {"class": "green"})
texts = [item.get_text() for item in items]
# Now you have the texts of the span elements

BTW your code sample is incorrect because bsObj would not be defined.

Upvotes: 1

Related Questions