hoops9682
hoops9682

Reputation: 35

BeautifulSoup scraping issue

My following code (almost) manages to scrape each players data into rows, with column values separated by commas. However, it seems that the player names have underlying children which are also being displayed in separate rows. I simply want the text of the name, not the links. Also, some records are repeated in my output. Any help would be greatly appreciated! I am using BS4 and Python 3.5. Here is my code:

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    page = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(page, "html.parser")
    return soupdata

currentdata = ""
soup = make_soup("http://www.foxsports.com/soccer/stats? competition=1&season=20160&category=STANDARD&pos=0&team=0&isOpp=0&sort=3&sortOrder=0&page=0")
for record in soup.findAll('tr'):
    playerdata = ""
    for data in record.findAll('td'):
        playerdata = playerdata + "," + data.text
        currentdata = currentdata + "\n" + playerdata

        print(currentdata)

Upvotes: 0

Views: 107

Answers (1)

宏杰李
宏杰李

Reputation: 12158

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    page = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(page, "html.parser")
    return soupdata

currentdata = ""
soup = make_soup("http://www.foxsports.com/soccer/stats? competition=1&season=20160&category=STANDARD&pos=0&team=0&isOpp=0&sort=3&sortOrder=0&page=0")
for record in soup.findAll('tr', class_=False):

    row = [data.get_text(',', strip=True) for data in record.findAll('td')]
    print(' '.join(row))

out:

1,Sánchez, Alexis,Sánchez, A.,ARS 21 20 1786 14 7 30 72 3 0
1,Costa, Diego,Costa, D.,CHE 19 19 1681 14 5 26 57 5 0
1,Ibrahimovic, Zlatan,Ibrahimovic, Z.,MUN 20 20 1800 14 3 36 89 5 0
4,Kane, Harry,Kane, H.,TOT 16 16 1360 13 2 27 53 0 0
5,Lukaku, Romelu,Lukaku, R.,EVE 20 19 1737 12 4 28 55 3 0
5,Defoe, Jermain,Defoe, J.,SUN 21 21 1882 12 2 18 57 1 0
  1. get data in a list, than join them together, do not use string to concatenate.
  2. to unselect the tr you do not want, use class_=False, this will select tr which does not have class attribute.
  3. get_text() can define an separator.

Upvotes: 1

Related Questions