Reputation: 196
I am trying to extract text from this website: searchgurbani. This website has some old scripture translated in English and Punjabi (an Indian Language) line-by-line. It makes a very good parallel corpus. I have successfully extracted all the English translations in a separate text file. But when I go for Punjabi, It returns nothing.
This is the Inspect element screenshot: (Highlighted text is the translated Punjabi language)
In Screenshot 1, highlighted text which belongs to class=lang_16 is not listed in the soup object beautiful which should contain all of the HTML. Here is the Python code:
outputFilePunjabi = open("1.txt","w",newline="",encoding="utf-16")
r=urlopen("")
beautiful = BeautifulSoup(r.read().decode('utf-8'),"html5lib")
#beautiful = BeautifulSoup(r.read().decode('utf-8'),"lxml")
punjabi_text = beautiful.find_all(class_="lang_16")
for i in punjabi_text:
outputFilePunjabi.write(i.get_text())
outputFilePunjabi.write('\n')
If I run the same code with class_=lang_4 it does the work.
Please do the following to see lang_16 in inspect element:
Please do the following on that web page: Go to preferences --> Tick "translation of Sri Guru Granth Sahib ji (by S. Manmohan Singh) - Punjabi" under Additional Translations available on Guru Granth Shahib: --> scroll down - submit changes -> reopen page
Please guide me where I am going wrong.
(python version = 3.5)
PS: I have very less experience in web scraping.
Upvotes: 2
Views: 297
Reputation: 474241
Remember you've suggested to do the following:
Please do the following on that web page: Go to preferences -> Tick "ranslation of Sri Guru Granth Sahib ji (by S. Manmohan Singh) - Punjabi" under Additional Translations available on Guru Granth Shahib: -> scroll down - submit changes
Now, this is also required when you download the page in Python. In other words, use requests
and set the lang_16="yes"
cookie to enable the Punjabi translation:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.get("https://www.searchgurbani.com/guru_granth_sahib/ang_by_ang", cookies={
"lang_16": "yes"
})
soup = BeautifulSoup(response.content, "html5lib")
for item in soup.select(".lang_16"):
print(item.get_text())
Prints:
ਵਾਹਿਗੁਰੂ ਕੇਵਲ ਇਕ ਹੈ। ਸੱਚਾ ਹੈ ਉਸ ਦਾ ਨਾਮ, ਰਚਨਹਾਰ ਉਸ ਦੀ ਵਿਅਕਤੀ ਅਤੇ ਅਮਰ ਉਸ ਦਾ ਸਰੂਪ। ਉਹ ਨਿਡਰ, ਕੀਨਾ-ਰਹਿਤ, ਅਜਨਮਾ ਤੇ ਸਵੈ-ਪ੍ਰਕਾਸ਼ਵਾਨ ਹੈ। ਗੁਰਾਂ ਦੀ ਦਯਾ ਦੁਆਰਾ ਉਹ ਪਰਾਪਤ ਹੁੰਦਾ ਹੈ।
ਉਸ ਦਾ ਸਿਮਰਨ ਕਰ।
ਪਰਾਰੰਭ ਵਿੱਚ ਸੱਚਾ, ਯੁਗਾਂ ਦੇ ਸ਼ੁਰੂ ਵਿੱਚ ਸੱਚਾ,
ਅਤੇ ਸੱਚਾ ਉਹ ਹੁਣ ਭੀ ਹੈ, ਹੇ ਨਾਨਕ! ਨਿਸਚਿਤ ਹੀ, ਉਹ ਸੱਚਾ ਹੋਵੇਗਾ।
...
ਕਈ ਇਕ ਗਾਇਨ ਕਰਦੇ ਹਨ ਕਿ ਵਾਹਿਗੁਰੂ ਪ੍ਰਾਣ ਲੈ ਲੈਂਦਾ ਹੈ ਤੇ ਮੁੜ ਵਾਪਸ ਦੇ ਦਿੰਦਾ ਹੈ।
ਕਈ ਗਾਇਨ ਕਰਦੇ ਹਨ ਕਿ ਹਰੀ ਦੁਰੇਡੇ ਮਲੂਮ ਹੁੰਦਾ ਅਤੇ ਸੁੱਝਦਾ ਹੈ।
Upvotes: 2