Reputation: 163
I am very new to Webscrapping in python, I have no error in the code but the out seems to be correct but the problem is with the language it's ouptput. So I tried my hand with IMDB the popular website. I inspect the html code, I want to extract the name of the movie, rating, etc. This is the website for IMBD with 250 movies and rating https://www.imdb.com/chart/top/ My code to scrape the data as follows, I use the module, BeautifulSoup and request
# We use the request module to access the website IMDB
source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues
source.raise_for_status()
# The following will return html parser code,
soup = BeautifulSoup(source.text, 'html.parser')
movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
#print(len(movies))
# Let iterate through each tr tag
for movie in movies:
# Use break to check only the first element of the list
#break
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
year = movie.find('td', class_='titleColumn').span.text.strip('()')
rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text
print(name, rank, year, rating)
Everything in the website is English how come my output is foreign language?
The output is the following
刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲:王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲:魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲:雙城奇謀 14 2002 8.7
星際大戰五部曲:帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6
Upvotes: 0
Views: 1041
Reputation: 837
You can add Accept-Language
to your header before requesting.
headers = {'Accept-Language': 'en-US,en;q=0.5'}
source = requests.get('https://www.imdb.com/chart/top/', headers=headers)
Accept-Language
is an HTTP header, which indicates the language and locale that the client prefers (according to Accept-Language MDN docs). By adding this header, you're telling the server that you need response with English (US) language and locale. Therefore, if the server supports that language, and also utilizes this header, you will get what you need.headers
is a key-value variable, python requests
supports to define it by using python dict
. It's optional, and you can add it by following this documentation: Python Requests - Custom HeadersUpvotes: 4
Reputation: 136
I assume that your IP is located in China? There is a chance that IMBD does geo-location and set your language to Mandarin.
You have the same problem with this person, and I think the same answer apply. Add an header to your request and set the language to English.
Python change Accept-Language using requests
Upvotes: 1