Reputation: 477

Could not scrape a Japanese website using beautifulsoup

I tried to scrape a japanese website by trying some simple tutorial online but I could not get the information from the website. Below is my code:

import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = requests.get(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'lxml')
for i in soup.findAll('data payments'):
    print(i.text)

What I wanted to get is from the below part:

                <dl class="data payments">
                    <dt>賃料：</dt>
                    <dd><span class="num">7.3万円</span></dd>
                </dl>

I wish to print our the data payment which is "賃料" with price "7.3万円".

Expected(In string):

"payment: 賃料 is 7.3万円"

Edited:

import requests
wiki = "https://www.athome.co.jp/"
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')
print(soup.decode('utf-8', 'replace'))

Upvotes: 0

Answers (3)

Blanker

Reputation: 54

In your latest version of code, you decode the soup and you will not be able to use functions like find and find_all in BeautifulSoup. But we will talk about it later.

To begin with

After getting the soup, you can print the soup, and you will see: (only showing the key part)

<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="10; url=/distil_r_captcha.html?requestId=2ac19293-8282-4602-8bf5-126d194a4827&amp;httpReferrer=%2Fchintai%2F1001303243%2F%3FDOWN%3D2%26BKLISTID%3D002LPC%26sref%3Dlist_simple%26bi%3Dtatemono" http-equiv="refresh"/>

Which means that you do not obtain enough elements and you are detected as a crawler.

Therefore, there is something missing in @KunduK's answer, there has nothing to do with the find function yet.

Main Part

First of all, you need to make your python script less like a crawler.

Headers

The headers are most usually used to detect the cralwer. In original requests, when you get a session from requests, you can check the headers with:

>>> s = requests.session()
>>> print(s.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

You can see that the headers here will tell the server that you are a crawler program, which is python-requests/2.22.0.

Therefore, you need to modify the User-Agent with updating headers.

s = requests.session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

However, when testing the cralwer, it is still detected as a crawerl. Therefore, we need to dig further in headers part. (But it could be other reason like IP blocker, or Cookie reason. I will mention them later.)

In the Chrome, we open the Developer Tools, and open the website. (To pretend it is your first visit of the website, you had better clear the cookies first.) After clearing the cookies, refresh the page. We could see in the Network card of Developer Tools, it shows a lot of requests from the Chrome.

By entering the first attribute, which is https://www.athome.co.jp/, we could see a detailed table on the right side, in which the Request Headers are the headers generated by Chrome and used to requests the server of target site.

To make sure everthing works fine, you could just add everthing in this Chrome headers to your crawler, and it cannot find out you are the real Chrome or crawler anymore. (For most of sites, but I have also find some sites use starnge setting requiring a special header in every requests.)

I have already digged out that after adding accept-language, the website's anti-cralwer function will let you pass.

Therefore, all together, you need to update your headers like this.

headers = {
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

Cookie

For the explaination of cookie, you can refer to the wiki. To obtain the cookie, there is a easy way. First, initial a session and update the header, like I mentioned above. Second, request to get the page https://www.athome.co.jp, once you get the page, you will obtain a cookie issued by the server.

s.get(url='https://www.athome.co.jp')

The advantage of requests.session is the session will help you to maintain the cookies, so your next request will use this cookie automatically.

You can just check the cookie you obtained by using this:

print(s.cookies)

And my result is:

<RequestsCookieJar[Cookie(version=0, name='athome_lab', value='ffba98ff.592d4d027d28b', port=None, port_specified=False, domain='www.athome.co.jp', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1884177606, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>

You do not need to parse this page, because you just want the cookie rather than the content.

To get the content

You can just use the session you obtained to request the wiki page you mentioned.

wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)

And now, everything you want will be posted to you by the server, you can just parse them with BeautifulSoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

After getting the content you want, you can use BeautifulSoup to get the target element.

soup.find('dl', attrs={'class': 'data payments'})

And what you will get is:

<dl class="data payments">
<dt>賃料：</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>

And you can just extract the infomation you want from it.

target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()

To format it as a line.

print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))

Everything has been done.

Summary

I will paste the code below.

# Import packages you want.
import requests
from bs4 import BeautifulSoup

# Initiate a session and update the headers.
s = requests.session()
headers = {
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

# Get the homepage of the website and get cookies.
s.get(url='https://www.athome.co.jp')
"""
# You might need to use the following part to check if you have successfully obtained the cookies. 
# If not, you might be blocked by the anti-cralwer.
print(s.cookies)
"""
# Get the content from the page.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)

# Parse the webpage for getting the elements.
soup = BeautifulSoup(page.content, 'html.parser')
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()

# Print the result.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))

In crawler field, there is a long way to go.

You had better get the ontline of it, and make full use of the Developer Tools in the browser.

You might need to find out if the content is loaded by JavaScript, or if the content is in a iframe.

What's more, you migh be detected as a crawler and be blocked by the server. The anti-anti-crawler technique can only be obtained by coding more frequently.

I suggest you to start with an easier website without the anti-crawler function.

Upvotes: 4

KunduK

Reputation: 33384

Try the below code.use class-name with tag to find the element.

from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)

soup = BeautifulSoup(page.content, 'lxml')

for i in soup.find_all("dl",class_="data payments"):
   print(i.find('dt').text)
   print(i.find('span').text)

Output:

賃料：
7.3万円

If you want to manipulate your expected output.Try that.

from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)

soup = BeautifulSoup(page.content, 'lxml')

for i in soup.find_all("dl",class_="data payments"):
   print("Payment: " + i.find('dt').text.split('：')[0] + " is " + i.find('span').text)

Output:

Payment: 賃料 is 7.3万円

Upvotes: 1

Sebastian Kreft

Reputation: 8189

The problem you are having is because the site is blocking your requests due to the fact that identifies it as coming from a bot.

The usual trick to do that is to attach the same headers (including the cookies) your browser sends in the request. You can see all headers Chrome is sending if you go to Inspect > Network > Request > Copy > Copy as Curl.

When you run your script, you get the following:

You reached this page when attempting to access https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono from 152.172.223.133 on 2019-09-18 02:21:34 UTC.