Saurav Raj Joshi
Saurav Raj Joshi

Reputation: 5

Web scraping using beautiful soup is giving inaccurate results

So I am using beautiful soup and trying to get list of companies from the website https://www.weps.org/companies . This function I have made simply takes the url "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page=0" and adds 1 at the last digit till its 310 to get the list from all the pages .Then simplet get text is used to get the data and saved to csv . I got almost complete list , but some are not in chronological orders and sometimes some are repeated too . I think basically 95% or more of the data is accurate but some are altered . What could be the reason ? This is my code :

#!/usr/bin/python3

import requests
from bs4
import BeautifulSoup
import pandas as pd

company = []
types = []
requrl = "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page=0"
reqlist = list(requrl)
j = 0
for i in range(0, 310):
    reqlist[-1] = j
    j = j + 1
    listToStr = ''.join([str(elem) for elem in reqlist])
    page = requests.get(listToStr)
    soup = BeautifulSoup(page.content, 'html.parser')
    company_only = soup.select(".field-content .skiptranslate")
    company = company + [cm.get_text() for cm in company_only]
    types_only = soup.select(".views-field-nothing .field-content")
    types = types + [tp.get_text() for tp in types_only]

data = pd.DataFrame({
    'Name': company,
    'Type | Location | Date': types# 'Type | Location | Data': types
})

data.to_csv(r 'finalfile.csv', index = False)

Upvotes: 0

Views: 238

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9430

I tried tidying you code and using requests.session(). Your range is wrong it only goes to page 309. I stripped white space to make it easier to parse.

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import pandas as pd

session = requests.session()
company = []
types = []
base_url = "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page="
# The last page with data on is 310 so use range(0, 311).
for i in range(0, 311):
    page = session.get(f'{base_url}{i}')
    soup = BeautifulSoup(page.content, 'html.parser')
    company_only = soup.select(".field-content .skiptranslate")
    company = company + [cm.get_text().strip() for cm in company_only]
    types_only = soup.select(".views-field-nothing .field-content")
    types = types + [tp.get_text().strip() for tp in types_only]

data = pd.DataFrame({
    'Name': company,
    'Type | Location | Date': types# 'Type | Location | Data': types
})

data.to_csv(r'finalfile.csv', index=False)

I then counted the lines in the file:

cat finalfile.csv |  wc -l
3104

The website was reporting 3103 Companies, plus the headers in the csv file, it's correct.

Then I counted the unique lines in the file:

cat finalfile.csv |  sort -u | wc -l
3091

Some companies are repeated so I printed the difference:

cat finalfile.csv | sort | uniq -d
Banco Amazonas S.A.,Banks  | Americas and the Caribbean | Ecuador | 09 May 2019
Careem,Software & Computer Services  | Arab States | Qatar | 13 May 2018
Careem,Software & Computer Services  | Asia and the Pacific | Pakistan | 13 May 2018
Hong Kong Exchanges and Clearing Limited,"Financial Services  | Asia and the Pacific | China, Hong Kong SAR |"
H?TAY PLAZA,General Retailers  | Europe and Central Asia | Turkey | 06 March 2019
"Kowa Co., Ltd.",Health Care Equipment & Services  | Asia and the Pacific | Japan | 17 September 2010
Madrigal Sports,General Industrials  | Asia and the Pacific | Pakistan | 05 December 2017
Novartis Corporativo S.A. de C.V.,Health Care Providers  | Global | Mexico | 07 February 2020
Poppins Corporation,Support Services  | Asia and the Pacific | Japan | 17 September 2010
Procter & Gamble Japan K.K.,Food & Drug Retailers  | Asia and the Pacific | Japan | 17 September 2010
"Shiseido Co., Ltd.",Personal Goods  | Asia and the Pacific | Japan | 17 September 2010
Tesco PLC,Food & Drug Retailers  | Europe and Central Asia | United Kingdom of Great Britain and Northern Ireland | 06 March 2019
Xiaohongshu,Internet  | Asia and the Pacific | China | 05 March 2020

I repeated running the script and bash commands and got the same result. So I conclude that the 3103 Companies listed on the website have duplicates on the website and there are none missing from the results.

Just to check I searched for the keyword "Careem" and got duplicated results.

Upvotes: 1

Related Questions