Scraping a table for links, getting data from each link, and doing the same for many pages (pagination)

Question

I have a list of 5000 best movies, spanning 50 pages. The website is

http://5000best.com/movies/

I want to extract the names of the 5000 movies, then click on each movie name link. Each link will redirect me to the IMDb page. Then, I want the director's name to be extracted. This will give me a table with 5000 rows, with the columns being the name of the movie and the director. This data will be exported to csv or to xlsx.

I have the following for extracting text so far:

import requests
start_url = 'http://5000best.com/movies/'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)

ascripter · Accepted Answer

Ok, here is the main logic for pagination. Hope you get along from there. To capture all pages just loop until the next page doesn't exist.

import requests
import bs4

i = 1
while 1:
    url = f'http://5000best.com/movies/{i}'
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text)

    # looking at the HTML we can find the main table
    table = soup.find('table', id="ttable")

    # analyse the HTML and process the table here

    # if the table is empty, we are beyond the last page
    if len(table.find_all('tr')) == 0:
        break
    i += 1

Scraping a table for links, getting data from each link, and doing the same for many pages (pagination)

Answers (2)

Related Questions