Reputation: 47
Relatively new, hoping for some direction!
For a project, I am looking to scrape the data in the following table into a dataframe from this source: https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States There are two tables on this page - I am interested in the second "ZCTAs ranked by per capita income".
When reviewing the html for the page I am unable to find something to specifically identify the table (or am unsure what to look for). I am not sure what tag to look for when calling soup.find_all() for the table class. The code for the table reads:
<table class="toccolours sortable jquery-tablesorter" align="center" cellpadding="4" cellspacing="3" style="border: 1px solid #707070;">
Both tables on the page are of the same table class. The header above the table I am trying to scrape lists a distinct id, "ZCTAs_ranked_by_per_capita_income". Directly above the table I'd like to scrape is the following code:
<h2><span class="mw-headline" id="ZCTAs_ranked_by_per_capita_income">ZCTAs ranked by per capita income</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States&action=edit&section=3" title="Edit section: ZCTAs ranked by per capita income">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
ranked by per capita income
Any help would be much appreciated - if more info is needed, please let me know!
Upvotes: 1
Views: 82
Reputation: 195543
You can use panda's .read_html
with right index:
url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"
df = pd.read_html(url)[5]
print(df)
Prints:
Rank Designation ZCTA Population Per CapitaIncome
0 1 Montchanin, Delaware[2] 19710 68 654485
1 2 Houston, Texas 77010 76 283189
2 3 Rockland, Delaware[3] 19732 77 279424
3 4 Miami Beach, Florida 33109 467 236238
4 5 Pineland, Florida 33945 79 162075
5 6 Esopus, New York 12429 51 155540
6 7 Henderson, Nevada 89012 175 148899
7 8 Atherton, California 94027 6857 114359
8 9 Boca Grande, Florida 33921 1500 107297
9 10 Deer Harbor, Washington[4] 98243 141 107173
10 11 Rancho Santa Fe, California 92067 7601 104487
11 12 Palm Beach, Florida 33480 11200 104294
12 13 Indianapolis, Indiana 46290 189 103347
13 14 Kenilworth, Illinois 60043 2617 99087
14 15 Beverly Hills, California 90210 21396 97198
15 16 Greenwich, Connecticut 6831 15167 97111
16 17 Los Angeles, California 90077 10465 96584
17 18 Portola Valley, California 94028 6595 96373
18 19 New York, New York 10022 30642 95196
19 20 Wyarno, Wyoming[5] 82845 49 94109
20 21 Short Hills, New Jersey 7078 12849 92940
21 22 Altamahaw, North Carolina[6] 27202 24 91666
22 23 Santa Monica, California 90402 11492 91147
23 24 New York, New York 10021 102078 91064
24 25 Gladwyne, Pennsylvania 19035 4050 90940
25 26 New York, New York 10069 1403 90113
26 27 Point Clear, Alabama 36564 107 89571
27 28 Boston, Massachusetts 2199 1005 88974
28 29 San Francisco, California 94105 2058 88829
29 30 Glencoe, Illinois 60022 8490 88126
30 31 Belvedere-Tiburon, California[7] 94920 13048 86992
31 32 Glencoe, Arkansas 72539 318 86724
32 33 Los Angeles, California 90067 2524 86319
33 34 Atlanta, Georgia 30327 21003 85883
34 35 New York, New York 10028 44987 85866
35 36 Houston, Texas 77046 471 85070
36 37 Lake McDonald, Montana[8] 59921 2 85000
37 38 New York, New York 10162 1726 84938
38 39 Mullett Lake, Michigan 49761 31 84692
39 40 Mc Afee, New Jersey[9] 7428 127 84595
40 41 New York, New York 10280 6614 83639
41 42 Yorklyn, Delaware[10] 19736 63 83524
42 43 Chicago, Illinois 60611 26522 82930
43 44 Boston, Massachusetts 2110 1428 82736
44 45 Boston, Massachusetts 2109 3428 82689
45 46 New York, New York 10282 1574 82348
46 47 Far Hills, New Jersey 7931 2766 82227
47 48 New Canaan, Connecticut 6840 19402 81934
48 49 Medina, Washington 98039 3050 81926
49 50 Pacific Palisades, California 90272 22538 81609
50 51 Los Altos, California 94022 18466 81257
51 52 San Francisco, California 94123 22903 81044
52 53 Longboat Key, Florida 34228 7603 80963
53 54 Davis, California 95618 643 80713
54 55 Alpine, New Jersey 7620 1649 80621
55 56 Atlanta, Georgia 30326 1075 80161
56 57 New York, New York 10023 62206 79736
57 58 Winnetka, Illinois 60093 19528 79651
58 59 Weston, Massachusetts 2493 11469 79640
59 60 Bacova, Virginia[11] 24412 89 79439
60 61 Springboro, Ohio 45066 17409 78786
61 62 Boston, Massachusetts 2108 3446 78771
62 63 Chappaqua, New York 10514 12004 78647
63 64 St. Louis, Missouri 63124 9819 78598
64 65 Ardsley-on-Hudson, New York[13] 10503 115 78591
65 66 New York, New York 10024 61414 77824
66 67 Essex Fells, New Jersey 7021 2151 77787
67 68 Rye, New York 10580 16737 77721
68 69 Glenbrook, Nevada[14] 89413 365 77639
69 70 Darien, Connecticut 6820 19607 77519
70 71 Captiva, Florida 33924 339 77458
71 72 Mill Neck, New York 11765 732 77420
72 73 Rex, North Carolina 28378 49 77306
73 74 Indian Wells, California 92210 3859 77302
74 75 Newport Coast, California 92657 5586 76870
75 76 Corona del Mar, California 92625 13407 76704
76 77 Wilmington, Delaware 19807 7345 76651
77 78 Dallas, Texas 75225 20314 76203
78 79 Chicago, Illinois 60601 5591 76157
79 80 Lake Forest, Illinois 60045 22248 75991
80 81 Los Angeles, California 90049 33520 75965
81 82 Vero Beach, Florida 32963 14077 75761
82 83 Bedford, New York 10506 5537 75723
83 84 San Francisco, California 94111 3335 75344
84 85 Weston, Connecticut 6883 10037 74817
85 86 Paradise Valley, Arizona 85253 17560 74605
86 87 Pound Ridge, New York 10576 4530 74127
87 88 Westport, Connecticut 6880 25807 74064
88 89 Washington, D.C. 20004 901 73803
89 90 Old Westbury, New York 11568 3992 72932
90 91 New York, New York 10128 59856 72691
91 92 Teterboro, New Jersey 7608 18 72613
92 93 Old Greenwich, Connecticut[15] 6870 7092 72317
93 94 Austin, Texas 78730 4885 72110
94 95 Bloomfield Hills, Michigan 48302 16409 71985
95 96 Norwalk, Connecticut 6853 3466 71642
96 97 Rumson, New Jersey 7760 9665 71585
97 98 Corolla, North Carolina 27927 648 71301
98 99 Gates Mills, Ohio 44040 2883 71016
99 100 Chicago, Illinois 60606 1682 70878
If you want to be more specific, you can use bs4
and CSS selectors:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tbl = soup.select("h2:has(#ZCTAs_ranked_by_per_capita_income) + table")
df = pd.read_html(str(tbl))[0]
print(df)
Upvotes: 2