Reputation: 175
I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
Upvotes: 0
Views: 1887
Reputation: 949
If you define a data structure from within a loop, each iteration of the loop will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.
Upvotes: 1
Reputation: 3242
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
Upvotes: 3