Reputation: 1059
I am attempting to scrape this website using BeautifulSoup package. I have successfully scraped the page, using pointers from this solution, but am trying to achieve pagination.
import pandas as pd
import requests
from bs4 import BeautifulSoup
for num in range(0, 800,80):
url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
table = soup.find('table', id="transaction-report")
rows = table.find_all('tr')
data = []
final = []
for row in rows[1:]:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
final = final.append(data)
result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type',
'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
'Line Number', 'Owner CIK', 'Security Name'])
print(result)
The pages increase in increments of 80. However, I am unable to put the pages together in the same dataframe. I tried to create a list called final
to append the data
from each page, but I am unsuccessful in doing so.
Upvotes: 0
Views: 172
Reputation: 91
You have to put the final list outside the loop and it will work.
import pandas as pd
import requests
from bs4 import BeautifulSoup
final = []
for num in range(0, 800,80):
url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
table = soup.find('table', id="transaction-report")
rows = table.find_all('tr')
data = []
for row in rows[1:]:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
final = final.append(data)
result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type',
'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
'Line Number', 'Owner CIK', 'Security Name'])
print(result)
Upvotes: 1