Reputation: 13
I am trying to scrape data from this website http://rgphentableaux.hcp.ma/Default1/ by clicking on the 2 radio buttons, then by choosing from a list like this enter image description here
I need to do this for all the choices available in that list and add the tables to a dataframe I have here is what I tried so far but it didn't work
from bs4 import BeautifulSoup
pip install selenium
from selenium import webdriver
browser=webdriver.Chrome()
url = "http://rgphentableaux.hcp.ma/Default1/"
browser.get(url) #navigate to the page
browser.find_element_by_xpath(".//input[@type='radio' and
@value='5']").click()
browser.find_element_by_id("CGEO").click()
time.sleep(3)
browser.find_element_by_xpath(".//input[@type='button' and
@value='Afficher']").click()
tabs = browser.find_elements_by_id('IEE')
innerHTML = browser.execute_script("return
document.body.innerHTML")
soup_level2=BeautifulSoup(innerHTML, 'html.parser')
Ps: I need to get the tables that are here too
Upvotes: 1
Views: 109
Reputation: 84465
You could do the whole thing with requests
and bs4
by mimicking the requests the page makes. You just need to loop the regions, in the right order, and add the current region number to the 'CGEO'
param in each request.
This:
soup = bs(s.get(url).content, 'lxml')
regions = [i.text.strip() for i in soup.select('#REGIONSLIST option')]
gathers an initial list of the region names from the landing url.
This:
for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
sets the CGEO
param with the option
tag value
attribute for the region e.g.
Tanger-Tetouan-Al Hoceima
is '01'
.
Region
option is set within the type
param.
Langues locales utilisées
option is set within the them
param i.e. '5'
.
This:
for y in range(3):
row.extend([data[i-y+2]['DATA2014']])
just reverses the order of items such that Ens, Fem, Masc
in each dictionary within data
gets added to the row
in the desired output order of Masc, Fem, Ens
.
Py:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
def add_rows(region, data):
for i in range(0, len(data)//3, 3):
row = [region, data[i]['INDICATEUR'].split('_')[-1]]
for y in range(3):
row.extend([data[i-y+2]['DATA2014']])
final.append(row)
url = 'http://rgphentableaux.hcp.ma/Default1'
headers= {'User-Agent': 'Mozilla/5.0', 'Referer': url}
final = []
with requests.Session() as s:
s.headers = headers
soup = bs(s.get(url).content, 'lxml')
regions = {i.text.strip():i['value'].strip() for i in soup.select('#REGIONSLIST option')}
for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
r = s.get(f'{url}/getDATA/', params=params)
data = r.json()
add_rows(k, data)
df = pd.DataFrame(final, columns = ['Region', 'Lang', 'Masc', 'Fem', 'Ens'])
print(df)
EDIT:
To get all 3 tables (ensemble, urbain, rural) adjust the custom function as shown below and add in the additional loop for n in range(0, len(data), block)
:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
def add_rows(table, region, data_block):
for i in range(0, len(data_block), 3):
row = [table, region, data_block[i]['INDICATEUR'].split('_')[-1]]
for y in range(3):
row.extend([data_block[i-y+2]['DATA2014']])
final.append(row)
url = 'http://rgphentableaux.hcp.ma/Default1'
headers= {'User-Agent': 'Mozilla/5.0', 'Referer': url}
tables = ['ens', 'urb', 'rur']
final = []
with requests.Session() as s:
s.headers = headers
soup = bs(s.get(url).content, 'lxml')
regions = {i.text.strip():i['value'].strip() for i in soup.select('#REGIONSLIST option')}
for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
r = s.get(f'{url}/getDATA/', params=params)
data = r.json()
block = len(data)//3
for n in range(0, len(data), block):
table = tables[n//block]
add_rows(table, k, data[n:n+block])
df = pd.DataFrame(final, columns = ['Table', 'Region', 'Language', 'Masc', 'Fem', 'Ens'])
print(df)
Upvotes: 1
Reputation: 193098
To select the item with text as Langues locales utilisées and Region and scrape the table you can use the following solution:
driver.get("http://rgphentableaux.hcp.ma/Default1/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@text='Langues locales utilisées']"))).click()
driver.find_element_by_xpath("//input[@value='Region']").click()
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Choisir une entitée']"))))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Choisir une entitée']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//li[contains(., 'Tanger-Tetouan-Al Hoceima')]"))).click()
driver.find_element_by_xpath("//input[@value='Afficher']").click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='tableau']/tbody"))).text)
Console Output:
Population municipale 16 747 522 16 862 562 33 610 084
Répartition selon les grands groupes d'âges
Moins de 6 ans 12.4 11.8 12.1
De 6 à 14 ans 16.5 15.7 16.1
De 15 à 59 ans 61.8 63.0 62.4
60 ans et plus 9.3 9.5 9.4
Répartition selon le groupe d'âges quinquennal
0-4 ans 10.4 9.9 10.2
5-9 ans 9.2 8.8 9.0
10-14 ans 9.3 8.8 9.0
15-19 ans 8.9 8.8 8.9
20-24 ans 9.0 9.1 9.1
25-29 ans 8.2 8.4 8.3
30-34 ans 7.7 8.0 7.8
35-39 ans 6.8 7.2 7.0
40-44 ans 6.3 6.5 6.4
45-49 ans 5.3 5.6 5.4
50-54 ans 5.3 5.4 5.3
55-59 ans 4.2 4.0 4.1
60-64 ans 3.4 3.3 3.4
65-69 ans 1.9 1.9 1.9
70-74 ans 1.6 1.8 1.7
75 ans et plus 2.4 2.6 2.5
État matrimonial
Célibataire 57.9 48.4 53.2
Marié 40.8 42.0 41.4
Divorcé 0.7 2.4 1.6
Veuf 0.6 7.1 3.9
Âge moyen au premier mariage 31.3 25.7 28.5
Fécondité
Parité moyenne à 45-49 ans / 3.5 /
Indice synthétique de fécondité / 2.2 /
Upvotes: 0