Reputation: 165
I am trying to help out our soccer coach who is doing some work on helping underprivileged kids get recruited. I am trying to scrape a "topdrawer" website page so we can track where players get placed. I am not a python expert at all and am banging my head against the wall. I got some help yesterday and tried to implement - see two sets of code below. Neither puts the data into a nice table we can sort and analyze etc. Thanks in advance for any help.
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
df = pd.read_html(source)
df = pd.DataFrame(df)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
The second method jumbles the output up into one line with /n separators etc
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
#table = soup.table
table_rows = table.find_all('tr')
with open('result.csv', 'a', newline='') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(str(row))
in the first version the data is all place on one line and not separated. The second version puts each page into one cell and splits the pages in half.
Upvotes: 0
Views: 109
Reputation: 143187
Page may have many <table>
in HTML (sometimes used to create menu or to organize elements on page) and pandas.read_html()
creates DataFrame
for every <table>
on page and it always returns list with all created DataFrames
(even if there was only one <table>
) and you have to check which one has your data. You can display every DataFrame
from list to see which one you need. This way I know that first DataFrame has your data and you have to use [0]
to get it.
import pandas as pd
max_page_num = 15 # it has to be 15 instead of 14 because `range(15)` will give `0-14`
with open('result.csv', 'w', newline='') as f:
f.write('Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment\n')
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
all_tables = pd.read_html(source)
df = all_tables[0]
print('items:', len(df))
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
EDIT:
In second version you should use strip()
to remove \n
which csv would tread as beginning of new row.
You shouldn't use str(row)
because it creates string with [ ]
which is not correct in csv file. You should rather use ",".join(row)
to create string. And you have to add \n
at the end of every row because write()
doesn't add it.
But it could be better to use csv
module and its writerow()
for this. It will convert list to string with ,
as separtor and add \n
automatically. If some item will have ,
or \n
then it will put it in " "
to create correct row.
import bs4 as bs
import urllib.request
import csv
max_page_num = 15
fh = open('result.csv', "w", newline='')
csv_writer = csv.writer(fh)
csv_writer.writerow( ["Name", "Gender", "State", "Position", "Grad", "Club/HS", "Rating", "Commitment"] )
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#row = [i.text.strip() for i in td] # strip to remove spaces and '\n'
row = [i.get_text(strip=True) for i in td] # strip to remove spaces and '\n'
if row: # check if row is not empty
#print(row)
csv_writer.writerow(row)
fh.close()
Upvotes: 1