Nguyen Thai
Nguyen Thai

Reputation: 61

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link

Here is my code:

from bs4 import BeautifulSoup
import requests


url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')

info=soup.find('div',class_='_1wb6qi0n')

title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')

print(title)

I expect title would be كابستون علوم البيانات التطبيقية but the result is منهجية علم البيانات.

What is the problem? And how do I fix it?

Thank you for taking time to answer.

Upvotes: 3

Views: 185

Answers (4)

Melvin Abraham
Melvin Abraham

Reputation: 3036

The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:

...

# Request the URL and store the request
request = requests.get(url)

# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding

# Now extract the HTML as text
html = request.text

...

In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.

So, the final code would be as follows:

from bs4 import BeautifulSoup
import requests

url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'

request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text

soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')

print(title.text)

PS: You must call title.text before printing to print the inner content of the tag.

Output:

كابستون علوم البيانات التطبيقية

Upvotes: 4

ELAi
ELAi

Reputation: 184

What were causing the error is the encoding of the html data.

Arabic letters need 2 bytes to show

You need to set html data encoding to UTF-8

from bs4 import BeautifulSoup
import requests


url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')

info=soup.find('div',class_='_1wb6qi0n')

title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()

print(title)

In above apparent_encoding will automatically set the encoding to what suits the data

OUTPUT :

كابستون علوم البيانات التطبيقية

Upvotes: 2

Sabil
Sabil

Reputation: 4510

There a nice library called ftfy. It has multiple language support.

Installation: pip install ftfy

Try this:

from bs4 import BeautifulSoup
import ftfy

import requests


url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')

info=soup.find('div',class_='_1wb6qi0n')

title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)

print(title)

Output:

كابستون علوم البيانات التطبيقية

Upvotes: 0

ali.k.mirzaei
ali.k.mirzaei

Reputation: 43

I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

Upvotes: -1

Related Questions