JodeCharger100
JodeCharger100

Reputation: 1059

Scraping a table for links, getting data from each link, and doing the same for many pages (pagination)

I have a list of 5000 best movies, spanning 50 pages. The website is

http://5000best.com/movies/

I want to extract the names of the 5000 movies, then click on each movie name link. Each link will redirect me to the IMDb page. Then, I want the director's name to be extracted. This will give me a table with 5000 rows, with the columns being the name of the movie and the director. This data will be exported to csv or to xlsx.

I have the following for extracting text so far:

import requests
start_url = 'http://5000best.com/movies/'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)

Upvotes: 0

Views: 247

Answers (2)

Pratheesh Russell
Pratheesh Russell

Reputation: 625

I think the issue is getting the pagination link This is how the link works

http://5000best.com/?m.c&xml=1&ta=13&p=1&s=&sortby=0&y0=&y1=&ise=&h=01000000000000000

There are 2 parameters that change with each page the p and h (Although the links seem to work irrespective of the h parameter)

so the link for page 2 will look like this:

http://5000best.com/?m.c&xml=1&ta=13&p=2&s=&sortby=0&y0=&y1=&ise=&h=02000000000000000

and 50 be like:

http://5000best.com/?m.c&xml=1&ta=13&p=50&s=&sortby=0&y0=&y1=&ise=&h=05000000000000000

Hope you can handle the rest

Upvotes: 1

ascripter
ascripter

Reputation: 6223

Ok, here is the main logic for pagination. Hope you get along from there. To capture all pages just loop until the next page doesn't exist.

import requests
import bs4

i = 1
while 1:
    url = f'http://5000best.com/movies/{i}'
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text)

    # looking at the HTML we can find the main table
    table = soup.find('table', id="ttable")

    # analyse the HTML and process the table here

    # if the table is empty, we are beyond the last page
    if len(table.find_all('tr')) == 0:
        break
    i += 1

Upvotes: 1

Related Questions