Reputation: 1059
I have a list of 5000
best movies, spanning 50
pages. The website is
http://5000best.com/movies/
I want to extract the names of the 5000 movies, then click on each movie name link. Each link will redirect me to the IMDb
page. Then, I want the director's
name to be extracted.
This will give me a table with 5000 rows, with the columns being the name of the movie and the director.
This data will be exported to csv or to xlsx.
I have the following for extracting text so far:
import requests
start_url = 'http://5000best.com/movies/'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
Upvotes: 0
Views: 247
Reputation: 625
I think the issue is getting the pagination link This is how the link works
http://5000best.com/?m.c&xml=1&ta=13&p=1&s=&sortby=0&y0=&y1=&ise=&h=01000000000000000
There are 2 parameters that change with each page the p
and h
(Although the links seem to work irrespective of the h parameter)
so the link for page 2 will look like this:
http://5000best.com/?m.c&xml=1&ta=13&p=2&s=&sortby=0&y0=&y1=&ise=&h=02000000000000000
and 50 be like:
http://5000best.com/?m.c&xml=1&ta=13&p=50&s=&sortby=0&y0=&y1=&ise=&h=05000000000000000
Hope you can handle the rest
Upvotes: 1
Reputation: 6223
Ok, here is the main logic for pagination. Hope you get along from there. To capture all pages just loop until the next page doesn't exist.
import requests
import bs4
i = 1
while 1:
url = f'http://5000best.com/movies/{i}'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
# looking at the HTML we can find the main table
table = soup.find('table', id="ttable")
# analyse the HTML and process the table here
# if the table is empty, we are beyond the last page
if len(table.find_all('tr')) == 0:
break
i += 1
Upvotes: 1