user2051904
user2051904

Reputation: 73

Web scraping with Python - output has extra characters

I'm new to python and am learning webscraping on a udemy course. I'm trying to scrape some output from demo site and although i' able to get the results, looks like there are some code characters which i'm unable to convert into regular text.

#!/usr/bin/env python3.6
'''
webscraping html, webpage data.
1. get the authors of quotes on first page.
2. create a list of all the quotes on first page.
3. extract the top ten tags on the home page.
'''
import bs4, requests, urllib
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/'
with urllib.request.urlopen(base_url) as response:
    html = response.read()
    text = str(html)
    soup = BeautifulSoup(text, 'lxml')

def get_author():
    authors = set()
    for name in soup.select('.author'):
        authors.add(name.text)
    print(authors)

def get_quotes():
    quotes = []
    for quote in soup.select('.text'):
        quotes.append(quote.text)
    print(quotes)

def top_ten_tags():
    toptags = []
    for tags in soup.select('.tag-item'):
        toptags.append(tags.text)
    print(toptags)


get_author()
get_quotes()
top_ten_tags()

Output:

{'Albert Einstein', 'Jane Austen', 'Thomas A. Edison', 'J.K. Rowling', 'Andr\\xc3\\xa9 Gide', 'Steve Martin', 'Eleanor Roosevelt', 'Marilyn Monroe'}
['\\xe2\\x80\\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\\xe2\\x80\\x9d', "\\xe2\\x80\\x9cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\\xe2\\x80\\x9d", '\\xe2\\x80\\x9cTry not to become a man of success. Rather become a man of value.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cIt is better to be hated for what you are than to be loved for what you are not.\\xe2\\x80\\x9d', "\\xe2\\x80\\x9cI have not failed. I've just found 10,000 ways that won't work.\\xe2\\x80\\x9d", "\\xe2\\x80\\x9cA woman is like a tea bag; you never know how strong it is until it's in hot water.\\xe2\\x80\\x9d", '\\xe2\\x80\\x9cA day without sunshine is like, you know, night.\\xe2\\x80\\x9d']
['\\n            love\\n            ', '\\n            inspirational\\n            ', '\\n            life\\n            ', '\\n            humor\\n            ', '\\n            books\\n            ', '\\n            reading\\n            ', '\\n            friendship\\n            ', '\\n            friends\\n            ', '\\n            truth\\n            ', '\\n            simile\\n            ']

As you can see, the authors set should have the name "Andre Gide" with an umlaut and for some reason python is not printing that.For the second list, which is the quotes, it prints code characters that i don't understand. Can anyone please tell me what i'm doing wrong here?

Upvotes: 0

Views: 295

Answers (1)

DYZ
DYZ

Reputation: 57033

Your problem is with forcefully converting HTML text to a string instead of properly decoding it:

text = html.decode("utf8")
soup = BeautifulSoup(text, 'lxml')
get_author()
#{... 'André Gide', 'Eleanor Roosevelt'...}

Upvotes: 1

Related Questions