Webscraping issue where dataframe to csv put output into one cell

Question

I am trying to help out our soccer coach who is doing some work on helping underprivileged kids get recruited. I am trying to scrape a "topdrawer" website page so we can track where players get placed. I am not a python expert at all and am banging my head against the wall. I got some help yesterday and tried to implement - see two sets of code below. Neither puts the data into a nice table we can sort and analyze etc. Thanks in advance for any help.

import bs4 as bs
import urllib.request
import pandas as pd
import csv


max_page_num = 14
max_page_dig = 1 # number of digits in the page number


with open('result.csv',"w", newline='') as f:
    f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment 
")

for i in range(0, max_page_num):  
   page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
   source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
   df = pd.read_html(source)
   df = pd.DataFrame(df)
   df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.

The second method jumbles the output up into one line with /n separators etc

import bs4 as bs
import urllib.request
import pandas as pd
import csv


max_page_num = 14
max_page_dig = 1 # number of digits in the page number


with open('result.csv',"w", newline='') as f:
    f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment 
")

for i in range(0, max_page_num):  
    page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
    print(page_num)
    source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
    print(source)

    url = urllib.request.urlopen(source).read()    

    soup = bs.BeautifulSoup(url,'lxml')
    table = soup.find('table')
    #table = soup.table
    table_rows = table.find_all('tr')

    with open('result.csv', 'a', newline='') as f:
       for tr in table_rows:
           td = tr.find_all('td')
           row = [i.text for i in td]
           f.write(str(row))

in the first version the data is all place on one line and not separated. The second version puts each page into one cell and splits the pages in half.

furas · Accepted Answer

Page may have many

in HTML (sometimes used to create menu or to organize elements on page) and pandas.read_html() creates DataFrame for every

on page and it always returns list with all created DataFrames (even if there was only one

) and you have to check which one has your data. You can display every DataFrame from list to see which one you need. This way I know that first DataFrame has your data and you have to use [0] to get it.

import pandas as pd

max_page_num = 15 # it has to be 15 instead of 14 because `range(15)` will give `0-14`

with open('result.csv', 'w', newline='') as f:
    f.write('Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment
')

for i in range(max_page_num):  
   print('page:', i)

   page_num = str(i)
   source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"

   all_tables = pd.read_html(source)
   df = all_tables[0]
   print('items:', len(df))

   df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.

EDIT:

In second version you should use strip() to remove which csv would tread as beginning of new row.

You shouldn't use str(row) because it creates string with [ ] which is not correct in csv file. You should rather use ",".join(row) to create string. And you have to add at the end of every row because write() doesn't add it.

But it could be better to use csv module and its writerow() for this. It will convert list to string with , as separtor and add automatically. If some item will have , or then it will put it in " " to create correct row.

import bs4 as bs
import urllib.request
import csv


max_page_num = 15


fh = open('result.csv', "w", newline='')
csv_writer = csv.writer(fh)

csv_writer.writerow( ["Name", "Gender", "State", "Position", "Grad", "Club/HS", "Rating", "Commitment"] )

for i in range(max_page_num):
    print('page:', i)

    page_num = str(i)
    source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"

    url = urllib.request.urlopen(source).read()    

    soup = bs.BeautifulSoup(url, 'lxml')
    table = soup.find('table')
    table_rows = table.find_all('tr')

    for tr in table_rows:
        td = tr.find_all('td')
        #row = [i.text.strip() for i in td] # strip to remove spaces and '
'
        row = [i.get_text(strip=True) for i in td] # strip to remove spaces and '
'
        if row: # check if row is not empty
            #print(row)
            csv_writer.writerow(row)

fh.close()

Webscraping issue where dataframe to csv put output into one cell

Answers (1)

Related Questions