ZacharyRW
ZacharyRW

Reputation: 49

Scraping Data with Beautiful Soup

i'm trying to scrape a movie names from my Vudu Movie List into a csv file. Im at the early stages, and i can't figure out how to use BeautifulSoup to get the name. I know where its located in the html on the website. I have it set to print the location now, but all it returns in "None".

I have included my code progress so far and a photo of the html code from the website that i need. Thanks to anyone who helps!

##Make sure to replace USERNAME and PASSWORD with your own username and password

#Import libraries
from bs4 import BeautifulSoup
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
import json
import re
import requests
import time
import urllib.request

#Login Information
USERNAME = "example"
PASSWORD = "example"

#URLs
login_url = "https://my.vudu.com/MyLogin.html?type=sign_in&url=https%3A%2F%2Fwww.vudu.com%2F"
url = "https://www.vudu.com/movies/#my_vudu/my_movies"

def main():
    session_requests = requests.session()

    chromedriver = 'C:\\chromedriver.exe'
    browser = webdriver.Chrome(chromedriver)
    browser.get('https://my.vudu.com/MyLogin.html?type=sign_in&url=https%3A%2F%2Fwww.vudu.com%2F')

    time.sleep(10)

    username = browser.find_element_by_name('email')
    password = browser.find_element_by_name('password')

    username.send_keys(USERNAME)
    password.send_keys(PASSWORD)

    browser.find_element_by_css_selector('.custom-button').click()

    html = urllib.request.urlopen(url)

    soup = BeautifulSoup(html, 'html.parser')

    name_box = soup.find('div', attrs={'class': 'gwt-Label title'})

    print (name_box)

if __name__ == '__main__':
    main()

enter image description here

Upvotes: 0

Views: 423

Answers (1)

furas
furas

Reputation: 143098

urllib.request.urlopen(url) (and requests.get(url)) gets HTML direcetly from server and it means it doesn't have element added by JavaScript in web browser. And also it is not logged in.

But you user Selenium which load page and runs JavaScript and you can get HTML with all changes from browser.page_source and use in

soup = BeautifulSoup(browser.page_source, 'html.parser')

Question is why to use BeautifulSoup if Selenium has functions find_* to seach on page.


EDIT: example which uses methods in Selenium and BeautifulSoup

from selenium import webdriver
from bs4 import BeautifulSoup
import time

#chromedriver = 'C:\\chromedriver.exe'
#browser = webdriver.Chrome(chromedriver)
browser = webdriver.Firefox()

browser.get("https://www.vudu.com/")
time.sleep(1)

print('--- Selenium ---')

all_images = browser.find_elements_by_css_selector('.border .gwt-Image')
for image in all_images[:5]: # first five elements
    #print('image:', image.get_attribute('src'))
    print('alt:', image.get_attribute('alt'))

print('--- BeautifulSoup ---')

soup = BeautifulSoup(browser.page_source, 'html.parser')

all_images = soup.select('.border .gwt-Image')
for image in all_images[:5]: # first five elements
    #print('image:', image['src'])
    print('alt:', image['alt'])

Result:

--- Selenium ---
alt: It (2017)
alt: American Made
alt: Dunkirk
alt: mother!
alt: The LEGO NINJAGO Movie
--- BeautifulSoup ---
alt: It (2017)
alt: American Made
alt: Dunkirk
alt: mother!
alt: The LEGO NINJAGO Movie

Upvotes: 1

Related Questions