Reputation: 61
I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title
would be كابستون علوم البيانات التطبيقية
but the result is منهجية علم البيانات
.
What is the problem? And how do I fix it?
Thank you for taking time to answer.
Upvotes: 3
Views: 185
Reputation: 3036
The issue you are facing is due to improper encoding when fetching the URL using requests.get()
function. By default the pages requested via requests
library have a default encoding of ISO-8859-1
which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding
attribute of the requested page. For this to work the line requests.get(url).text
has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding
will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text
before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية
Upvotes: 4
Reputation: 184
What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding
will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية
Upvotes: 2
Reputation: 4510
There a nice library called ftfy
. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية
Upvotes: 0
Reputation: 43
I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!
Upvotes: -1