Reputation: 479
I am trying to scrape a javascript heavy website. I am trying to get a specific columns contents. The page needs to load and then navigate to a new page. I would like to extract the sport info from the page.
I am using Pandas
BeautifulSoup
and Selenium
Navigating to the next page works fine and the loading wait times. The below is the BeautifulSoup code:
soup = BeautifulSoup(results.get_attribute("outerHTML"), 'html.parser')
time = [] # Time
sport = [] # Sport Name
description = [] # Sport Description
The below is the code that will search for the xPath
of the specific parts of the page.
# Programme time
for item in soup.select("guide___1Ogg9"):
if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]'):
time.append(item.find_next(
find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]').text.strip())
else:
time.append("Nan")
# Sport Name
if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span'):
sport.append(item.find_next(
find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span').text.strip())
else:
sport.append("Nan")
# Programme info
if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]'):
description.append(item.find_next(
find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]').text.strip())
else:
description.append("Nan")
Below is the function to print all the data into the csv file.
df = pd.DataFrame(
{"Time": time, "Sport": sport, "Info": description})
print("Here is your data. Right I am off to sleep then!")
print(df)
df.to_csv("canalPlusSport.csv")
I have tried to search the CSS_SELECTOR
and CLASS_NAME
The website is https://www.canalplus.com/programme-tv/
Upvotes: 1
Views: 615
Reputation: 20050
You're right saying the site's JavaScript
heavy but that might mean there's an API on the backend. And, actually, in this case there is one.
You can use it to fetch the data you want.
Here's how:
import datetime
import pendulum
import requests
from tabulate import tabulate
api_url = "https://secure-webtv-static.canal-plus.com/metadata/cpfra/all/v2.2/globalchannels.json"
response = requests.get(api_url).json()
tv_programme = {
channel["name"]: [
[
e['title'],
e['subTitle'],
pendulum.parse(e['timecodes'][0]['start']).time().strftime("%H:%M"),
datetime.timedelta(
milliseconds=e['timecodes'][0]['duration'],
).__str__().rsplit(".")[0],
] for e in channel["events"]
] for channel in response["channels"]
}
print(tabulate(
tv_programme["CANAL+"],
headers=["Title", "Subtitle", "Time", "Duration"],
tablefmt="sql",
))
This outputs (for CANAL+
, but you can try any channel):
Title Subtitle Date Duration
------------------------------------------------------------------------ ------------------------------- ------ ----------
Canal Football Club - Samedi - 1re édition Mag Foot 19:30 0:23:00
Avant-match Ligue 1 Mag Foot 19:58 0:04:36
Nice / Lyon 16e journée 20:02 0:50:00
Canal Football Club - Samedi - 2ème édition Mag Foot 21:59 0:55:00
Zapsport Mag Sport 22:56 0:03:41
Le Plus Le Show de Noël Must Go on Date 23:00 0:01:59
Le journal du hard Mag Adultes 23:02 0:01:07
Une nuit à Budapest Film Adultes 23:03 1:32:14
Furie Film Suspense 00:35 1:33:49
Zombi Child Film Emotion 02:10 1:38:39
Veuillez parler sans arrêt et décrire vos expériences au fur et à mesure Court-Metrage 03:49 0:09:04
Le grand rendez-vous Court-Metrage 03:58 0:05:39
Golf - US Open féminin 3e tour 04:05 1:08:26
EDIT:
To list all the channels, just add this print("\n".join(sorted(list(tv_programme.keys()))))
This will get you this:
6TER
AB1
ACTION
ALTICE STUDIO
ANIMAUX
ARTE
ASTROCENTER TV
AUTOMOTO LA CHAINE
BBC WORLD NEWS
BEIN SPORTS 1
BEIN SPORTS 2
BEIN SPORTS 3
BEIN SPORTS MAX 10
BEIN SPORTS MAX 4
BEIN SPORTS MAX 5
BEIN SPORTS MAX 6
BEIN SPORTS MAX 7
BEIN SPORTS MAX 8
BEIN SPORTS MAX 9
BET
BFM BUSINESS
BFM TV
BOB TV
BOING
BOOMERANG
BSMART TV
C8
C8 (CH)
CANAL 9
CANAL ALPHA NE
CANAL J
CANAL+
CANAL+ (CH)
CANAL+ CINEMA
CANAL+ CINEMA (CH)
CANAL+ DECALE
CANAL+ DECALE (CH)
CANAL+ FAMILY
CANAL+ FAMILY (CH)
CANAL+ FORMULA1
CANAL+ LIGUE1
CANAL+ MOTOGP
CANAL+ PREMIER LEAGUE
CANAL+ SERIES
CANAL+ SPORT
CANAL+ SPORT (CH)
CANAL+ TOP14
CANAL+ UHD
...
Upvotes: 1