Reputation: 21
I'm having trouble scraping multiple URLs. Essentially I'm able to run this for only one genre, but the second I include other links it stops working.
The goal is to get the data and place it into a csv file with the movie title, url, and genre. Any help would be appreciated!
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
containers = page_soup.findAll("li",{"class":"nm-content-horizontal-row-item"})
# name the output file to write to local disk
out_filename = "netflixaction2.csv"
# header of csv file to be written
headers = "Movie_Name, Movie_ID \n"
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
for container in containers:
title_container = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
title_container = title_container[0].text
movieid = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
movieid = movieid[0].attrs['href']
print("Movie Name: " + title_container, "\n")
print("Movie ID: " , movieid, "\n")
f.write(title_container + ", " + movieid + "\n")
f.close() # Close the file
Upvotes: 0
Views: 79
Reputation: 1163
The reason you are getting the error is that you trying to do a GET requests on a list.
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
uClient = uReq(my_url)
what I suggest to do here is to loop through each link etc:
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
for link in my_url:
uClient = uReq(link)
page_html = uClient.read()
....
and to mention, if you are just applying the code for the loop, it will override your f.write function. What you need to do is something like:
New edit:
import csv
import requests
from bs4 import BeautifulSoup as soup
# All given URLS
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
# Create and open CSV file
with open("netflixaction2.csv", 'w', encoding='utf-8') as csv_file:
# Headers for CSV
headers_for_csv = ['Movie Name', 'Movie Link']
# Small function for csv DictWriter
csv_writer = csv.DictWriter(csv_file, delimiter=',', lineterminator='\n', fieldnames=headers_for_csv)
csv_writer.writeheader()
# We need to loop through each URL from the list
for link in my_url:
# Do a simple GET requests with the URL
response = requests.get(link)
page_soup = soup(response.text, "html.parser")
# Find all nm-content-horizontal-row-item
containers = page_soup.findAll("li", {"class": "nm-content-horizontal-row-item"})
# Loop through each found "li"
for container in containers:
movie_name = container.text.strip()
movie_link = container.find("a")['href']
print(f"Movie Name: {movie_name} | Movie link: {movie_link}")
# Write to CSV
csv_writer.writerow({
'Movie Name': movie_name,
'Movie Link': movie_link,
})
# Close the file
csv_file.close()
That should be your solution :) Feel free to comment if i'm missing something!
Upvotes: 1