Reputation: 117
I have this frame:
regions = pd.read_html('http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html')
messy_regions = regions[8]
Which yields something like this:
|0 | 1
--- |---| ---
0| Region 1 (The Northeast)| nan
1| Division 1 (New England)| Division 2 (Middle Atlantic)
2| Maine | New York
3| New Hampshire | Pennsylvania
4| Vermont | New Jersey
5| Massachusetts |nan
6| Rhode Island |nan
7| Connecticut | nan
8| Region 2 (The Midwest) | nan
9| Division 3 (East North Central)| Division 4 (West North Central)
10| Wisconsin | North Dakota
11| Michigan | South Dakota
12| Illinois | Nebraska
The goal is to make this a tidy dataframe and I think I need to pivot in order to get the regions and Divisions as columns with the states as values under the correct regions/divisions. Once it's in that shape then I can just melt to the desired shape. I can't figure out though how to extract what would be the column headers out of this though. Any help is appreciated and at the very least a good point in the right direction.
Upvotes: 1
Views: 82
Reputation: 862591
You can use:
url = 'http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html'
#input dataframe with columns a, b
df = pd.read_html(url)[8]
df.columns = ['a','b']
#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
.sort_values(['Region', 'variable'])
.dropna()
.drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
.reset_index(drop=False)
.reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
print (df)
Region Division Names
0 Region 1 (The Northeast) Division 1 (New England) Maine
1 Region 1 (The Northeast) Division 1 (New England) New Hampshire
2 Region 1 (The Northeast) Division 1 (New England) Vermont
3 Region 1 (The Northeast) Division 1 (New England) Massachusetts
4 Region 1 (The Northeast) Division 1 (New England) Rhode Island
5 Region 1 (The Northeast) Division 1 (New England) Connecticut
6 Region 1 (The Northeast) Division 2 (Middle Atlantic) New York
7 Region 1 (The Northeast) Division 2 (Middle Atlantic) Pennsylvania
8 Region 1 (The Northeast) Division 2 (Middle Atlantic) New Jersey
9 Region 2 (The Midwest) Division 3 (East North Central) Wisconsin
10 Region 2 (The Midwest) Division 3 (East North Central) Michigan
11 Region 2 (The Midwest) Division 3 (East North Central) Illinois
12 Region 2 (The Midwest) Division 3 (East North Central) Indiana
13 Region 2 (The Midwest) Division 3 (East North Central) Ohio
...
...
Upvotes: 1