Anthony Ghanime
Anthony Ghanime

Reputation: 21

I'm having trouble scraping multiple URL's

I'm having trouble scraping multiple URLs. Essentially I'm able to run this for only one genre, but the second I include other links it stops working.

The goal is to get the data and place it into a csv file with the movie title, url, and genre. Any help would be appreciated!

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,"html.parser")

containers = page_soup.findAll("li",{"class":"nm-content-horizontal-row-item"})


# name the output file to write to local disk
out_filename = "netflixaction2.csv"
# header of csv file to be written
headers = "Movie_Name, Movie_ID \n"

# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)



for container in containers:
    
    title_container = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
    title_container = title_container[0].text

    movieid = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
    movieid = movieid[0].attrs['href']

    print("Movie Name: " + title_container, "\n")
    print("Movie ID: " , movieid, "\n")

    f.write(title_container + ", " + movieid + "\n")
f.close()  # Close the file

Upvotes: 0

Views: 79

Answers (1)

PythonNewbie
PythonNewbie

Reputation: 1163

The reason you are getting the error is that you trying to do a GET requests on a list.

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

uClient = uReq(my_url)

what I suggest to do here is to loop through each link etc:

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

for link in my_url:
    uClient = uReq(link)
    page_html = uClient.read()
    ....

and to mention, if you are just applying the code for the loop, it will override your f.write function. What you need to do is something like:

New edit:

import csv

import requests
from bs4 import BeautifulSoup as soup

# All given URLS
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

# Create and open CSV file
with open("netflixaction2.csv", 'w', encoding='utf-8') as csv_file:
    # Headers for CSV
    headers_for_csv = ['Movie Name', 'Movie Link']

    # Small function for csv DictWriter
    csv_writer = csv.DictWriter(csv_file, delimiter=',', lineterminator='\n', fieldnames=headers_for_csv)
    csv_writer.writeheader()

    # We need to loop through each URL from the list
    for link in my_url:

        # Do a simple GET requests with the URL
        response = requests.get(link)

        page_soup = soup(response.text, "html.parser")

        # Find all nm-content-horizontal-row-item
        containers = page_soup.findAll("li", {"class": "nm-content-horizontal-row-item"})

        # Loop through each found "li"
        for container in containers:
            movie_name = container.text.strip()
            movie_link = container.find("a")['href']

            print(f"Movie Name: {movie_name} | Movie link: {movie_link}")

            # Write to CSV
            csv_writer.writerow({
                'Movie Name': movie_name,
                'Movie Link': movie_link,
            })

# Close the file
csv_file.close()

That should be your solution :) Feel free to comment if i'm missing something!

Upvotes: 1

Related Questions