Pandas returning Empty Data Frame

Question

I am trying to scrape a javascript heavy website. I am trying to get a specific columns contents. The page needs to load and then navigate to a new page. I would like to extract the sport info from the page.

I am using Pandas BeautifulSoup and Selenium

Navigating to the next page works fine and the loading wait times. The below is the BeautifulSoup code:

soup = BeautifulSoup(results.get_attribute("outerHTML"), 'html.parser')
time = []  # Time
sport = []  # Sport Name
description = []  # Sport Description

The below is the code that will search for the xPath of the specific parts of the page.

# Programme time
for item in soup.select("guide___1Ogg9"):
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]'):
        time.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]').text.strip())
    else:
        time.append("Nan")

# Sport Name
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span'):
        sport.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span').text.strip())
    else:
        sport.append("Nan")

# Programme info
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]'):
        description.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]').text.strip())
    else:
        description.append("Nan")

Below is the function to print all the data into the csv file.

df = pd.DataFrame(
    {"Time": time, "Sport": sport, "Info": description})
print("Here is your data. Right I am off to sleep then!")

print(df)
df.to_csv("canalPlusSport.csv")

I have tried to search the CSS_SELECTOR and CLASS_NAME

The website is https://www.canalplus.com/programme-tv/

baduker · Accepted Answer

You're right saying the site's JavaScript heavy but that might mean there's an API on the backend. And, actually, in this case there is one.

You can use it to fetch the data you want.

Here's how:

import datetime

import pendulum
import requests
from tabulate import tabulate

api_url = "https://secure-webtv-static.canal-plus.com/metadata/cpfra/all/v2.2/globalchannels.json"
response = requests.get(api_url).json()

tv_programme = {
    channel["name"]: [
        [
            e['title'],
            e['subTitle'],
            pendulum.parse(e['timecodes'][0]['start']).time().strftime("%H:%M"),
            datetime.timedelta(
                milliseconds=e['timecodes'][0]['duration'],
            ).__str__().rsplit(".")[0],
        ] for e in channel["events"]
    ] for channel in response["channels"]
}


print(tabulate(
    tv_programme["CANAL+"],
    headers=["Title", "Subtitle", "Time", "Duration"],
    tablefmt="sql",
))

This outputs (for CANAL+, but you can try any channel):

Title                                                                     Subtitle                         Date    Duration
------------------------------------------------------------------------  -------------------------------  ------  ----------
Canal Football Club - Samedi - 1re édition                                Mag Foot                         19:30   0:23:00
Avant-match Ligue 1                                                       Mag Foot                         19:58   0:04:36
Nice / Lyon                                                               16e journée                      20:02   0:50:00
Canal Football Club - Samedi - 2ème édition                               Mag Foot                         21:59   0:55:00
Zapsport                                                                  Mag Sport                        22:56   0:03:41
Le Plus                                                                   Le Show de Noël Must Go on Date  23:00   0:01:59
Le journal du hard                                                        Mag Adultes                      23:02   0:01:07
Une nuit à Budapest                                                       Film Adultes                     23:03   1:32:14
Furie                                                                     Film Suspense                    00:35   1:33:49
Zombi Child                                                               Film Emotion                     02:10   1:38:39
Veuillez parler sans arrêt et décrire vos expériences au fur et à mesure  Court-Metrage                    03:49   0:09:04
Le grand rendez-vous                                                      Court-Metrage                    03:58   0:05:39
Golf - US Open féminin                                                    3e tour                          04:05   1:08:26

EDIT:

To list all the channels, just add this print(" ".join(sorted(list(tv_programme.keys()))))

This will get you this:

6TER
AB1
ACTION
ALTICE STUDIO
ANIMAUX
ARTE
ASTROCENTER TV
AUTOMOTO LA CHAINE
BBC WORLD NEWS
BEIN SPORTS 1
BEIN SPORTS 2
BEIN SPORTS 3
BEIN SPORTS MAX 10
BEIN SPORTS MAX 4
BEIN SPORTS MAX 5
BEIN SPORTS MAX 6
BEIN SPORTS MAX 7
BEIN SPORTS MAX 8
BEIN SPORTS MAX 9
BET
BFM BUSINESS
BFM TV
BOB TV
BOING
BOOMERANG
BSMART TV
C8
C8 (CH)
CANAL 9
CANAL ALPHA NE
CANAL J
CANAL+
CANAL+ (CH)
CANAL+ CINEMA
CANAL+ CINEMA (CH)
CANAL+ DECALE
CANAL+ DECALE (CH)
CANAL+ FAMILY
CANAL+ FAMILY (CH)
CANAL+ FORMULA1
CANAL+ LIGUE1
CANAL+ MOTOGP
CANAL+ PREMIER LEAGUE
CANAL+ SERIES
CANAL+ SPORT
CANAL+ SPORT (CH)
CANAL+ TOP14
CANAL+ UHD
...

Pandas returning Empty Data Frame

Answers (1)

Related Questions