JodeCharger100
JodeCharger100

Reputation: 1059

Table scraping and pagination in steps with BeautifulSoup

I am attempting to scrape this website using BeautifulSoup package. I have successfully scraped the page, using pointers from this solution, but am trying to achieve pagination.

import pandas as pd
import requests
from bs4 import BeautifulSoup
    
for num in range(0, 800,80):
    url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
    r = requests.get(url)
    html = r.text

    soup = BeautifulSoup(html)
    table = soup.find('table', id="transaction-report")
    rows = table.find_all('tr')
    data = []
    final = []
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    final = final.append(data)

result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type', 
                                     'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
                                     'Line Number', 'Owner CIK', 'Security Name'])

print(result)

The pages increase in increments of 80. However, I am unable to put the pages together in the same dataframe. I tried to create a list called final to append the data from each page, but I am unsuccessful in doing so.

Upvotes: 0

Views: 172

Answers (1)

Alpesh kabra
Alpesh kabra

Reputation: 91

You have to put the final list outside the loop and it will work.

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
final = [] 
for num in range(0, 800,80):
    url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
    r = requests.get(url)
    html = r.text

    soup = BeautifulSoup(html)
    table = soup.find('table', id="transaction-report")
    rows = table.find_all('tr')
    data = []
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    final = final.append(data)

result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type', 
                                     'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
                                     'Line Number', 'Owner CIK', 'Security Name'])

print(result)

Upvotes: 1

Related Questions