Reputation: 300
I am trying to parse 1st table located here using BeautifulSoup in Python. It parsed my First column but for some reason It didn't parsed entire table. Any help is appreciated!
Note: I am trying to parse entire table and convert into pandas dataframe
My Code:
import requests
from bs4 import BeautifulSoup
WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)
Upvotes: 2
Views: 170
Reputation: 28565
NOTE: Accept B.Adler's solution as it is good work and sound advice. This solution is simply so you can see some alternatives as you are learning.
Whenever I see <table>
tags, I'll usually check out pandas first to see if I can find what I need from the tables that way. pd.read_html()
will return a list of dataframes, and you can work/manipulate those to extract what you need.
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"
tables = pd.read_html(WIKI_URL)
You can also look through the dataframes to see which has the data you want. I just used dataframe in index position 2 for this one, which is the first table you were looking for
table = tables[2]
Output:
print (table)
0 1 ... 6 7
0 Team Won ... Total Games Conference
1 Michigan 953 ... 1331 Big Ten
2 Ohio State 1 911 ... 1289 Big Ten
3 Notre Dame 2 897 ... 1263 Independent
4 Boise State 448 ... 618 Mountain West
5 Alabama 3 905 ... 1277 SEC
6 Oklahoma 896 ... 1274 Big 12
7 Texas 908 ... 1311 Big 12
8 USC 4 839 ... 1239 Pac-12
9 Nebraska 897 ... 1325 Big Ten
10 Penn State 887 ... 1319 Big Ten
11 Tennessee 838 ... 1281 SEC
12 Florida State 5 544 ... 818 ACC
13 Georgia 819 ... 1296 SEC
14 LSU 797 ... 1259 SEC
15 Appalachian State 617 ... 981 Sun Belt
16 Georgia Southern 387 ... 616 Sun Belt
17 Miami (FL) 630 ... 1009 ACC
18 Auburn 759 ... 1242 SEC
19 Florida 724 ... 1182 SEC
20 Old Dominion 76 ... 121 C-USA
21 Coastal Carolina 112 ... 180 Sun Belt
22 Washington 735 ... 1234 Pac-12
23 Clemson 744 ... 1248 ACC
24 Virginia Tech 743 ... 1262 ACC
25 Arizona State 614 ... 1032 Pac-12
26 Texas A&M 741 ... 1270 SEC
27 Michigan State 701 ... 1204 Big Ten
28 West Virginia 750 ... 1292 Big 12
29 Miami (OH) 690 ... 1195 MAC
.. ... ... ... ... ...
101 Memphis 482 ... 1026 The American
102 Kansas 582 ... 1271 Big 12
103 Wyoming 526 ... 1122 Mountain West
104 Louisiana 510 ... 1098 Sun Belt
105 Colorado State 520 ... 1124 Mountain West
106 Connecticut 508 ... 1107 The American
107 SMU 489 ... 1083 The American
108 Oregon State 530 ... 1173 Pac-12
109 UTSA 38 ... 82 C-USA
110 Kansas State 526 ... 1207 Big 12
111 New Mexico 483 ... 1103 Mountain West
112 Temple 468 ... 1094 The American
113 Iowa State 524 ... 1214 Big 12
114 Tulane 520 ... 1197 The American
115 Northwestern 535 ... 1240 Big Ten
116 UAB 126 ... 284 C-USA
117 Rice 470 ... 1108 C-USA
118 Eastern Michigan 453 ... 1089 MAC
119 Louisiana-Monroe 304 ... 727 Sun Belt
120 Florida Atlantic 87 ... 205 C-USA
121 Indiana 479 ... 1195 Big Ten
122 Buffalo 370 ... 922 MAC
123 Wake Forest 450 ... 1136 ACC
124 New Mexico State 430 ... 1090 Independent
125 UTEP 390 ... 1005 C-USA
126 UNLV11 228 ... 574 Mountain West
127 Kent State 341 ... 922 MAC
128 FIU 64 ... 191 C-USA
129 Charlotte 20 ... 65 C-USA
130 Georgia State 27 ... 94 Sun Belt
[131 rows x 8 columns]
Upvotes: 2
Reputation: 1539
It only parsed one column because you did a findall for only the items in the first column. To parse the entire table you'd have to do a findall for the table rows <tr>
and then a findall within each row for the table divides <td>
. Right now you are just doing a findall for the links and then printing the links.
my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))
Upvotes: 5