Able Archer
Able Archer

Reputation: 569

How to Parse NHL Team Defense stats to create Pandas DataFrame using Python?

I have scraped data but need help parsing it correctly. I am still learning and will appreciate any advice I can get.

I am looking for the data for the following two variables: TEAM, SA/G

Here is my code so far:


#import modules
from selenium import webdriver

from bs4 import BeautifulSoup

#set path for driver
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')

# open page
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')

# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')

#close driver
driver.close()

#grab table data
table = soup.find(class_='tablehead')

#parse data (extra data included)
for t in table:
    td_tags = table.find_all('td')
    # print(td_tags)
    for td in td_tags:
        a_tags = table.find('a')
        print(td.text)

I have scraped the correct data but there is extra info that I could use help parsing. Any suggestion on how I can just get the TEAM and SA/G data?

Here is an example of the Pandas DataFrame output I am looking for:

Team             SA/G

Nashville        30.1

Colorado         33.6

Washington       31.0

Thanks in advance for any help that you may offer!

CODE UPDATE:

The 1st attempt grabbed only the Team info and had extra data ("GP", for example).

1st attempt at fixing code:

# parse data (closer to desired output but missing SA/G data)
 for tab in table:
     tr = table.find_all('tr')
     for t in tr:
         td = table.find_all('td')
         print((t.a.text))

The 2nd attempt grabbed both the Team data and SA/G but had extra data too ("TEAM" and "SA/G" text every 11 line of code, for example).

Here is the 2nd attempt:

#parses TEAM and SA/G
import pandas as pd
x = pd.read_html("http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals")[0]

print(x[[1, 9]])

Upvotes: 1

Views: 114

Answers (1)

mmngreco
mmngreco

Reputation: 556

If you want to read a table from a url, I would use the method read_html from pandas. Underneath, Pandas uses bs4 for parsing the web page for you. You can see an example of this below:

In [3]: import pandas as pd 
In [4]: pd.read_html("http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals")[0]
Out[4]:
     0             1   2   3   4     5     6      7     8     9      10     11   12    13    14
0    RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
1     1     Nashville  11  45  33  4.09  3.00   1.09  31.9  30.1   01.8   .900   87   109   -22
2     2      Colorado  11  44  30  4.00  2.73   1.27  31.4  33.6  -02.3   .919  102   140   -38
3     3    Washington  13  49  43  3.77  3.31   0.46  30.3  31.0  -00.7   .893  125   111    14
4     4     Vancouver  11  40  26  3.64  2.36   1.27  32.6  31.3   01.4   .924  103   119   -16
5   NaN      Montreal  11  40  35  3.64  3.18   0.45  34.4  31.1   03.3   .898   77    83    -6
6     6       Toronto  13  46  44  3.54  3.38   0.15  32.7  32.8  -00.1   .897   88    82     6
7     7       Florida  12  42  45  3.50  3.75  -0.25  34.0  30.0   04.0   .875   78    86    -8
8   NaN  Philadelphia  10  35  30  3.50  3.00   0.50  35.4  27.4   08.0   .891   78    90   -12
9     9       Buffalo  13  43  32  3.31  2.46   0.85  30.2  33.5  -03.2   .926  100   118   -18
10   10     Tampa Bay  10  33  32  3.30  3.20   0.10  31.4  34.5  -03.1   .907  100    88    12
11   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
12   11        Boston  11  36  23  3.27  2.09   1.18  33.3  31.5   01.7   .934   82    80     2
13  NaN      Carolina  11  36  29  3.27  2.64   0.64  32.9  29.4   03.5   .910   97    87    10
14   13    Pittsburgh  12  39  30  3.25  2.50   0.75  31.9  29.8   02.1   .916   82    84    -2
15   14    NY Rangers   9  29  34  3.22  3.78  -0.56  28.2  36.9  -08.7   .898   90    82     8
16   15     St. Louis  12  37  38  3.08  3.17  -0.08  29.0  30.3  -01.3   .895   87    91    -4
17   16         Vegas  13  40  36  3.08  2.77   0.31  35.3  32.7   02.6   .915  143   143     0
18   17      Edmonton  12  36  32  3.00  2.67   0.33  27.9  30.6  -02.7   .913   80    74     6
19  NaN       Arizona  11  33  24  3.00  2.18   0.82  31.5  29.8   01.6   .927   68    74    -6
20  NaN  NY Islanders  11  33  27  3.00  2.45   0.55  27.6  31.5  -03.8   .922   95    67    28
21   20      Columbus  11  30  39  2.73  3.55  -0.82  33.6  31.1   02.5   .886   75    81    -6
22   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
23   21        Ottawa  11  29  36  2.64  3.27  -0.64  31.1  35.0  -03.9   .906  134   110    24
24   22       Calgary  13  34  39  2.62  3.00  -0.38  30.9  31.2  -00.3   .904  147   122    25
25   23      San Jose  12  31  43  2.58  3.58  -1.00  28.3  31.8  -03.4   .887  128   124     4
26  NaN   Los Angeles  12  31  49  2.58  4.08  -1.50  37.3  28.3   08.9   .856  102   116   -14
27   25      Winnipeg  12  30  37  2.50  3.08  -0.58  33.2  33.3  -00.1   .907   52    88   -36
28  NaN       Chicago  10  25  30  2.50  3.00  -0.50  31.6  32.9  -01.3   .909   66    68    -2
29   27       Anaheim  13  32  31  2.46  2.38   0.08  27.5  31.5  -04.0   .924  131    99    32
30   28    New Jersey   9  22  34  2.44  3.78  -1.33  29.3  29.0   00.3   .870   99    93     6
31   29     Minnesota  11  26  37  2.36  3.36  -1.00  29.5  30.4  -00.8   .889   87    93    -6
32   30       Detroit  12  27  45  2.25  3.75  -1.50  31.5  33.2  -01.7   .887  105    96     9
33   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
34   31        Dallas  13  25  35  1.92  2.69  -0.77  27.8  28.8  -01.1   .907   89    79    10

Upvotes: 1

Related Questions