Suneetha Thentu
Suneetha Thentu

Reputation: 103

how to detect the language of webpage content by using python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.

I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>

Upvotes: 3

Views: 5480

Answers (4)

Eapen Jose
Eapen Jose

Reputation: 111

You can use BeautifulSoup to extract the language from HTML source code.

<html class="no-js" lang="cs">

Extract the lang field from source code:

from bs4 import BeautifulSoup
import requests

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Upvotes: 2

jrc
jrc

Reputation: 21939

Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:

from bs4 import BeautifulSoup
from langdetect import detect

with open("foo.html", "rb") as f:
    soup = BeautifulSoup(f, "lxml")
    [s.decompose() for s in soup("script")]  # remove <script> elements
    body_text = soup.body.get_text()
    print(detect(body_text))

Upvotes: 6

duong_dajgja
duong_dajgja

Reputation: 4276

You can extract a chunk of content then use some python language detection like langdetect or guess-language.

Upvotes: 4

Pierre.Sassoulas
Pierre.Sassoulas

Reputation: 4282

Maybe you have a header like this one :

<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">

If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.

Upvotes: 1

Related Questions