Reputation: 894

Removing new line characters in web scrape

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want

'D. Fletcher'

but instead I get

'LF\nD. Fletcher\nR'

Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.

I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.

from bs4 import BeautifulSoup
import requests


url = 'https://www.rotowire.com/baseball/daily_lineups.htm'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

players = soup.find_all('li', {'class': 'lineup__player'})

####for link in players.find('a'):
#####   print (link.string)

awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)

Upvotes: 2

Answers (4)

QHarr

Reputation: 84465

Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.

Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}

teams = [item.text for item in soup.select('.lineup__abbr')] #26

matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
    team_visit = teams[i]
    team_home = teams[i + 1]
    highlights = teambox.select('.lineup__player-highlight-name a')
    visit_highlight = highlights[0].text
    home_highlight = highlights[1].text
    match = team_visit + ' v ' + team_home
    visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
    home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
    matches[match] = {'visitor' : [{team_visit : visitors}] ,
                      'home' : [{team_home : home}]
                     }
    i+=1

Example info:

Current structure:

Upvotes: 1

Amit Nanaware

Reputation: 3346

You have to find a tag and title attribute in it, check below answer.

awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)

Output is:

['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

Upvotes: -1

sconfluentus

Reputation: 4993

I think you were almost there, you just needed to tweak it a little bit:

 awayPlayers = [player.find('a').text for player in players]

This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:

['L. Martin',
 'Jose Ramirez',
 'J. Luplow'...]

Upvotes: 0

Selcuk

Reputation: 59228

You should only get the .text for the a tag, not the whole li:

awayPlayers = [player.find('a').text.strip() for player in players]

That would result in something like the following:

['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...

Upvotes: 2

Removing new line characters in web scrape

Answers (4)

Related Questions