Reputation: 3852
The Chinese website here mainly describes the information of one company. Since there are many pages containing similar contents, I decided to learn data crawler in Python.
import requests
from bs4 import BeautifulSoup
page = requests.get('http://182.148.109.184/enterprise-
info!getCompanyInfo.action?companyid=1000356')
soup = BeautifulSoup(page.text, 'html.parser')
source_content = soup.find(class_='rightSide').find(class_='content register').find(class_='formestyle')
The figure was captured in Chrome inspect element page.
Maybe Chinese is not friendly here, I created an example here for better illustration.
<th> the variable name </th> => For example, "company name", "company location"
<td> the target data I want to save </td>
Based on my basic code, the source_content
contain no information inside . The output file was shown like this:
Comparing fig1, 2, we can see that the information of longitude, latitude has gone.
How to get those data with Python? Any advice would be appreciated
Upvotes: 2
Views: 72
Reputation: 46759
The information can be obtained if you provide a Referer
header in your request as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://182.148.109.184/enterprise-info!getCompanyInfo.action?companyid=1000356'
page = requests.get(url, headers={'Referer' : url})
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find(class_='formestyle')
for tr in table.find_all('tr'):
row = [v.text for v in tr.find_all(['th', 'td'])]
print(row)
This would display the following type of data:
['地理坐标:', '经度:104.2153 \xa0\xa0纬度:31.3631']
As you can see, the information is now present.
Upvotes: 1