ADH
ADH

Reputation: 47

Webscraping Single Wiki Table Using BeautifulSoup On Page with Multiple Tables

Relatively new, hoping for some direction!

For a project, I am looking to scrape the data in the following table into a dataframe from this source: https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States There are two tables on this page - I am interested in the second "ZCTAs ranked by per capita income".

When reviewing the html for the page I am unable to find something to specifically identify the table (or am unsure what to look for). I am not sure what tag to look for when calling soup.find_all() for the table class. The code for the table reads:

<table class="toccolours sortable jquery-tablesorter" align="center" cellpadding="4" cellspacing="3" style="border: 1px solid #707070;">

Both tables on the page are of the same table class. The header above the table I am trying to scrape lists a distinct id, "ZCTAs_ranked_by_per_capita_income". Directly above the table I'd like to scrape is the following code:

<h2><span class="mw-headline" id="ZCTAs_ranked_by_per_capita_income">ZCTAs ranked by per capita income</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States&amp;action=edit&amp;section=3" title="Edit section: ZCTAs ranked by per capita income">edit</a><span class="mw-editsection-bracket">]</span></span></h2>

ranked by per capita income

enter image description here

Any help would be much appreciated - if more info is needed, please let me know!

Upvotes: 1

Views: 82

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195543

You can use panda's .read_html with right index:

url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"

df = pd.read_html(url)[5]
print(df)

Prints:

    Rank                       Designation   ZCTA  Population  Per CapitaIncome
0      1           Montchanin, Delaware[2]  19710          68            654485
1      2                    Houston, Texas  77010          76            283189
2      3             Rockland, Delaware[3]  19732          77            279424
3      4              Miami Beach, Florida  33109         467            236238
4      5                 Pineland, Florida  33945          79            162075
5      6                  Esopus, New York  12429          51            155540
6      7                 Henderson, Nevada  89012         175            148899
7      8              Atherton, California  94027        6857            114359
8      9              Boca Grande, Florida  33921        1500            107297
9     10        Deer Harbor, Washington[4]  98243         141            107173
10    11       Rancho Santa Fe, California  92067        7601            104487
11    12               Palm Beach, Florida  33480       11200            104294
12    13             Indianapolis, Indiana  46290         189            103347
13    14              Kenilworth, Illinois  60043        2617             99087
14    15         Beverly Hills, California  90210       21396             97198
15    16            Greenwich, Connecticut   6831       15167             97111
16    17           Los Angeles, California  90077       10465             96584
17    18        Portola Valley, California  94028        6595             96373
18    19                New York, New York  10022       30642             95196
19    20                Wyarno, Wyoming[5]  82845          49             94109
20    21           Short Hills, New Jersey   7078       12849             92940
21    22      Altamahaw, North Carolina[6]  27202          24             91666
22    23          Santa Monica, California  90402       11492             91147
23    24                New York, New York  10021      102078             91064
24    25            Gladwyne, Pennsylvania  19035        4050             90940
25    26                New York, New York  10069        1403             90113
26    27              Point Clear, Alabama  36564         107             89571
27    28             Boston, Massachusetts   2199        1005             88974
28    29         San Francisco, California  94105        2058             88829
29    30                 Glencoe, Illinois  60022        8490             88126
30    31  Belvedere-Tiburon, California[7]  94920       13048             86992
31    32                 Glencoe, Arkansas  72539         318             86724
32    33           Los Angeles, California  90067        2524             86319
33    34                  Atlanta, Georgia  30327       21003             85883
34    35                New York, New York  10028       44987             85866
35    36                    Houston, Texas  77046         471             85070
36    37         Lake McDonald, Montana[8]  59921           2             85000
37    38                New York, New York  10162        1726             84938
38    39            Mullett Lake, Michigan  49761          31             84692
39    40            Mc Afee, New Jersey[9]   7428         127             84595
40    41                New York, New York  10280        6614             83639
41    42             Yorklyn, Delaware[10]  19736          63             83524
42    43                 Chicago, Illinois  60611       26522             82930
43    44             Boston, Massachusetts   2110        1428             82736
44    45             Boston, Massachusetts   2109        3428             82689
45    46                New York, New York  10282        1574             82348
46    47             Far Hills, New Jersey   7931        2766             82227
47    48           New Canaan, Connecticut   6840       19402             81934
48    49                Medina, Washington  98039        3050             81926
49    50     Pacific Palisades, California  90272       22538             81609
50    51             Los Altos, California  94022       18466             81257
51    52         San Francisco, California  94123       22903             81044
52    53             Longboat Key, Florida  34228        7603             80963
53    54                 Davis, California  95618         643             80713
54    55                Alpine, New Jersey   7620        1649             80621
55    56                  Atlanta, Georgia  30326        1075             80161
56    57                New York, New York  10023       62206             79736
57    58                Winnetka, Illinois  60093       19528             79651
58    59             Weston, Massachusetts   2493       11469             79640
59    60              Bacova, Virginia[11]  24412          89             79439
60    61                  Springboro, Ohio  45066       17409             78786
61    62             Boston, Massachusetts   2108        3446             78771
62    63               Chappaqua, New York  10514       12004             78647
63    64               St. Louis, Missouri  63124        9819             78598
64    65   Ardsley-on-Hudson, New York[13]  10503         115             78591
65    66                New York, New York  10024       61414             77824
66    67           Essex Fells, New Jersey   7021        2151             77787
67    68                     Rye, New York  10580       16737             77721
68    69             Glenbrook, Nevada[14]  89413         365             77639
69    70               Darien, Connecticut   6820       19607             77519
70    71                  Captiva, Florida  33924         339             77458
71    72               Mill Neck, New York  11765         732             77420
72    73               Rex, North Carolina  28378          49             77306
73    74          Indian Wells, California  92210        3859             77302
74    75         Newport Coast, California  92657        5586             76870
75    76        Corona del Mar, California  92625       13407             76704
76    77              Wilmington, Delaware  19807        7345             76651
77    78                     Dallas, Texas  75225       20314             76203
78    79                 Chicago, Illinois  60601        5591             76157
79    80             Lake Forest, Illinois  60045       22248             75991
80    81           Los Angeles, California  90049       33520             75965
81    82               Vero Beach, Florida  32963       14077             75761
82    83                 Bedford, New York  10506        5537             75723
83    84         San Francisco, California  94111        3335             75344
84    85               Weston, Connecticut   6883       10037             74817
85    86          Paradise Valley, Arizona  85253       17560             74605
86    87             Pound Ridge, New York  10576        4530             74127
87    88             Westport, Connecticut   6880       25807             74064
88    89                  Washington, D.C.  20004         901             73803
89    90            Old Westbury, New York  11568        3992             72932
90    91                New York, New York  10128       59856             72691
91    92             Teterboro, New Jersey   7608          18             72613
92    93    Old Greenwich, Connecticut[15]   6870        7092             72317
93    94                     Austin, Texas  78730        4885             72110
94    95        Bloomfield Hills, Michigan  48302       16409             71985
95    96              Norwalk, Connecticut   6853        3466             71642
96    97                Rumson, New Jersey   7760        9665             71585
97    98           Corolla, North Carolina  27927         648             71301
98    99                 Gates Mills, Ohio  44040        2883             71016
99   100                 Chicago, Illinois  60606        1682             70878

If you want to be more specific, you can use bs4 and CSS selectors:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

tbl = soup.select("h2:has(#ZCTAs_ranked_by_per_capita_income) + table")
df = pd.read_html(str(tbl))[0]
print(df)

Upvotes: 2

Related Questions