Reputation: 217
Below is a scraper that loops through two websites, scrapes a team's roster information, puts the information into an array, and exports the arrays into a CSV file. Everything works great, but the only problem is the writerow headers repeat in the csv file every time the scraper moves on to the second website. Is it possible to adjust the CSV portion of the code to have the headers only appear once when the scraper is looping through multiple websites? Thanks in advance!
import requests
import csv
from bs4 import BeautifulSoup
team_list={'yankees','redsox'}
for team in team_list:
page = requests.get('http://m.{}.mlb.com/roster/'.format(team))
soup = BeautifulSoup(page.text, 'html.parser')
soup.find(class_='nav-tabset-container').decompose()
soup.find(class_='column secondary span-5 right').decompose()
roster = soup.find(class_='layout layout-roster')
names = [n.contents[0] for n in roster.find_all('a')]
ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
number = [n.contents[0] for n in roster.find_all('td', index='0')]
handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
height = [n.contents[0] for n in roster.find_all('td', index='4')]
weight = [n.contents[0] for n in roster.find_all('td', index='5')]
DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
team = [soup.find('meta',property='og:site_name')['content']] * len(names)
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
f = csv.writer(fp)
f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
Upvotes: 4
Views: 5276
Reputation: 26315
Just write the header before the loop, and have the loop within the with
context manager:
import requests
import csv
from bs4 import BeautifulSoup
team_list = {'yankees', 'redsox'}
headers = ['Name', 'ID', 'Number', 'Hand', 'Height', 'Weight', 'DOB', 'Team']
# 1. wrap everything in context manager
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
f = csv.writer(fp)
# 2. write headers before anything else
f.writerow(headers)
# 3. now process the loop
for team in team_list:
# Do everything else...
You could also define your headers similarily to team_list
outside the loop, which leads to cleaner code.
Upvotes: 1
Reputation: 31
Another method would be to simply do it before the for loop so you do not have to check if already written.
import requests
import csv
from bs4 import BeautifulSoup
team_list={'yankees','redsox'}
with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
f = csv.writer(fp)
f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
for team in team_list:
do_your_bs4_and_parsing_stuff
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
f = csv.writer(fp)
f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
You can also open the document just once instead of three times as well
import requests
import csv
from bs4 import BeautifulSoup
team_list={'yankees','redsox'}
with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
f = csv.writer(fp)
f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
for team in team_list:
do_your_bs4_and_parsing_stuff
f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
Upvotes: 2
Reputation: 5942
Using a variable to check if header is added or not may be helpful. If header added it will not add second times
header_added = False
for team in team_list:
do_some stuff
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
f = csv.writer(fp)
if not header_added:
f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
header_added = True
f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
Upvotes: 3