biggboss2019
biggboss2019

Reputation: 300

BeautifulSoup parsed only one Column instead of entire Wikipedia table in Python

I am trying to parse 1st table located here using BeautifulSoup in Python. It parsed my First column but for some reason It didn't parsed entire table. Any help is appreciated!

Note: I am trying to parse entire table and convert into pandas dataframe

My Code:

import requests
from bs4 import BeautifulSoup

WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)

Upvotes: 2

Views: 170

Answers (2)

chitown88
chitown88

Reputation: 28565

NOTE: Accept B.Adler's solution as it is good work and sound advice. This solution is simply so you can see some alternatives as you are learning.

Whenever I see <table> tags, I'll usually check out pandas first to see if I can find what I need from the tables that way. pd.read_html() will return a list of dataframes, and you can work/manipulate those to extract what you need.

import pandas as pd

WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"

tables = pd.read_html(WIKI_URL)

You can also look through the dataframes to see which has the data you want. I just used dataframe in index position 2 for this one, which is the first table you were looking for

table = tables[2]

Output:

print (table)
                     0    1      ...                  6              7
0                 Team  Won      ...        Total Games     Conference
1             Michigan  953      ...               1331        Big Ten
2         Ohio State 1  911      ...               1289        Big Ten
3         Notre Dame 2  897      ...               1263    Independent
4          Boise State  448      ...                618  Mountain West
5            Alabama 3  905      ...               1277            SEC
6             Oklahoma  896      ...               1274         Big 12
7                Texas  908      ...               1311         Big 12
8                USC 4  839      ...               1239         Pac-12
9             Nebraska  897      ...               1325        Big Ten
10          Penn State  887      ...               1319        Big Ten
11           Tennessee  838      ...               1281            SEC
12     Florida State 5  544      ...                818            ACC
13             Georgia  819      ...               1296            SEC
14                 LSU  797      ...               1259            SEC
15   Appalachian State  617      ...                981       Sun Belt
16    Georgia Southern  387      ...                616       Sun Belt
17          Miami (FL)  630      ...               1009            ACC
18              Auburn  759      ...               1242            SEC
19             Florida  724      ...               1182            SEC
20        Old Dominion   76      ...                121          C-USA
21    Coastal Carolina  112      ...                180       Sun Belt
22          Washington  735      ...               1234         Pac-12
23             Clemson  744      ...               1248            ACC
24       Virginia Tech  743      ...               1262            ACC
25       Arizona State  614      ...               1032         Pac-12
26           Texas A&M  741      ...               1270            SEC
27      Michigan State  701      ...               1204        Big Ten
28       West Virginia  750      ...               1292         Big 12
29          Miami (OH)  690      ...               1195            MAC
..                 ...  ...      ...                ...            ...
101            Memphis  482      ...               1026   The American
102             Kansas  582      ...               1271         Big 12
103            Wyoming  526      ...               1122  Mountain West
104          Louisiana  510      ...               1098       Sun Belt
105     Colorado State  520      ...               1124  Mountain West
106        Connecticut  508      ...               1107   The American
107                SMU  489      ...               1083   The American
108       Oregon State  530      ...               1173         Pac-12
109               UTSA   38      ...                 82          C-USA
110       Kansas State  526      ...               1207         Big 12
111         New Mexico  483      ...               1103  Mountain West
112             Temple  468      ...               1094   The American
113         Iowa State  524      ...               1214         Big 12
114             Tulane  520      ...               1197   The American
115       Northwestern  535      ...               1240        Big Ten
116                UAB  126      ...                284          C-USA
117               Rice  470      ...               1108          C-USA
118   Eastern Michigan  453      ...               1089            MAC
119   Louisiana-Monroe  304      ...                727       Sun Belt
120   Florida Atlantic   87      ...                205          C-USA
121            Indiana  479      ...               1195        Big Ten
122            Buffalo  370      ...                922            MAC
123        Wake Forest  450      ...               1136            ACC
124   New Mexico State  430      ...               1090    Independent
125               UTEP  390      ...               1005          C-USA
126             UNLV11  228      ...                574  Mountain West
127         Kent State  341      ...                922            MAC
128                FIU   64      ...                191          C-USA
129          Charlotte   20      ...                 65          C-USA
130      Georgia State   27      ...                 94       Sun Belt

[131 rows x 8 columns]

Upvotes: 2

B.Adler
B.Adler

Reputation: 1539

It only parsed one column because you did a findall for only the items in the first column. To parse the entire table you'd have to do a findall for the table rows <tr> and then a findall within each row for the table divides <td>. Right now you are just doing a findall for the links and then printing the links.

my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
    print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))

Upvotes: 5

Related Questions