Mark von Oven
Mark von Oven

Reputation: 71

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage: You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:

# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd

#settings
http = urllib3.PoolManager(
        cert_reqs='CERT_REQUIRED',
        ca_certs=certifi.where())
gamelog_offense = []

#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0

for row in soup.findAll('tr'):
    try:
        col=row.findAll('td')
        Pass_cmp = col[4].get_text()
        Pass_att = col[5].get_text()
        gamelog_offense.append([Pass_cmp, Pass_att])
        cnt += 1
    except:
        pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records

I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.

import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
    if isinstance(item, bs4.element.Comment):
        big_comment = item
print(big_comment)
<div class="table_outer_container">
  <div class="overthrow table_container" id="div_defense">
   ...and so on....

Upvotes: 0

Views: 295

Answers (1)

Mark von Oven
Mark von Oven

Reputation: 71

Posting an answer here in case others find helpful. Many thanks to @TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.

try:
    defense = soup.find(id="all_defense")
    for item in defense.children:
        if isinstance(item, bs4.element.Comment):
            html = item
    Dsoup = BeautifulSoup(html, features="html.parser")
except:
    html = ''
    Dsoup = BeautifulSoup(html, features="html.parser")

Upvotes: 1

Related Questions