Reputation: 1105
I've a massive geo json in this form:
{'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
After reading the json:
import json
with open('x.json') as f:
data = json.load(f)
I read the values into a list and then into a dataframe:
#to get a list of all markets
mkt=set([f['properties']['MARKET'] for f in data['features']])
#to create a list of market and associated lat long
markets=[(market,list(chain.from_iterable(f['geometry']['coordinates']))) for f in data['features'] for market in mkt if f['properties']['MARKET']==mkt]
df = pd.DataFrame(markets[0:], columns=['a','b'])
First few rows of df are:
a b
0 Albany [[-74.264948, 42.419877, 0], [-74.262041, 42.4...
1 Albany [[-73.929627, 42.078788, 0], [-73.929114, 42.0...
2 Albany [[-74.769198, 43.114089, 0], [-74.76786, 43.11...
Then to unnest the nested list in column b, I used pandas concat
:
df1 = pd.concat([df.iloc[:,0:1], df['b'].apply(pd.Series)], axis=1)
But this is creating 8070 columns with many NaNs. Is there a way to group all the latitudes and longitudes by the Market (column a)? A million rows by two column dataframe is desired.
Desired op is:
mkt lat long
Albany 42.419877 -74.264948
Albany 42.078788 -73.929627
..
Albuquerque 35.105361 -106.640342
Pls note that the zero in the list element ([-74.769198, 43.114089, 0]) needs to be ignored.
Upvotes: 9
Views: 21227
Reputation: 5974
The answers above are excellent, but here's something a little different. The Awkward Array library (note: I'm the author) is meant for working with nested data structures like this at large scale. As a coincidence, I used a GeoJSON file as a motivating example in the documentation, though I'm working on a few more tutorials that take larger Parquet files as example data, unrelated to geography.
(That's how this is different from @kamal-barshevich's geopandas answer: geopandas is a domain-specific library that "knows about" geography and will likely have functionality relevant to domain experts in that field. Awkward Array is a generic library for manipulating data structures that doesn't know anything about geography.)
The documentation I've linked above has some examples of manipulating a GeoJSON file with the array functions themselves, without Pandas, starting from here:
>>> import urllib.request
>>> import awkward as ak
>>>
>>> url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
>>> bikeroutes_json = urllib.request.urlopen(url).read()
>>> bikeroutes = ak.from_json(bikeroutes_json)
>>> bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>
But in this answer, I'll make the Pandas structure that you want. The ak.to_pandas function turns nested lists into a MultiIndex. Applying it to just the "coordinates"
inside "geometry"
inside "features"
:
>>> bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>
>>>
>>> ak.to_pandas(bikeroutes.features.geometry.coordinates)
values
entry subentry subsubentry subsubsubentry
0 0 0 0 -87.788573
1 41.923652
1 0 -87.788646
1 41.923651
2 0 -87.788845
... ...
1060 0 8 1 41.950493
9 0 -87.714819
1 41.950724
10 0 -87.715284
1 41.951042
[96724 rows x 1 columns]
The list nesting is three levels deep, the last of which are longitude, latitude pairs (e.g. [-87.788573, 41.923652]
). You want these in separate columns:
>>> bikeroutes.features.geometry.coordinates[..., 0]
<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>
>>> bikeroutes.features.geometry.coordinates[..., 1]
<Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>
This is using a slice like NumPy's (Awkward Array is a generalization of NumPy), taking everything in the all dimensions except the last (...
); the first expression extracts items 0
(longitude) and the second extracts items 1
(latitude).
We can combine these in a new record type using ak.zip to give them column names:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... }))
longitude latitude
entry subentry subsubentry
0 0 0 -87.788573 41.923652
1 -87.788646 41.923651
2 -87.788845 41.923650
3 -87.788951 41.923649
4 -87.789092 41.923648
... ... ...
1060 0 6 -87.714026 41.950199
7 -87.714335 41.950388
8 -87.714486 41.950493
9 -87.714819 41.950724
10 -87.715284 41.951042
[48362 rows x 2 columns]
and this is pretty close to what you're looking for. The one last thing you wanted was to match each of these with one of the "properties"
from "features"
. My GeoJSON file doesn't have "MARKET"
:
>>> bikeroutes.features.properties.type
1061 * {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}
but "STREET"
may be a good stand-in. These properties are at a different level of nesting than the coordinates:
>>> bikeroutes.features.geometry.coordinates[..., 0].type
1061 * var * var * float64
>>> bikeroutes.features.properties.STREET.type
1061 * string
The longitude points are two levels of nested lists deeper than the street names, but ak.zip broadcasts them down (similar to NumPy's concept of broadcasting, with necessary extensions for variable-length lists).
The final expression is:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... "street": bikeroutes.features.properties.STREET,
... }))
longitude latitude street
entry subentry subsubentry
0 0 0 -87.788573 41.923652 W FULLERTON AVE
1 -87.788646 41.923651 W FULLERTON AVE
2 -87.788845 41.923650 W FULLERTON AVE
3 -87.788951 41.923649 W FULLERTON AVE
4 -87.789092 41.923648 W FULLERTON AVE
... ... ... ...
1060 0 6 -87.714026 41.950199 N ELSTON AVE
7 -87.714335 41.950388 N ELSTON AVE
8 -87.714486 41.950493 N ELSTON AVE
9 -87.714819 41.950724 N ELSTON AVE
10 -87.715284 41.951042 N ELSTON AVE
[48362 rows x 3 columns]
Since you just want to relate market with longitude, latitude points, you can ignore the MultiIndex, or you can use Pandas functions to turn the components of that MultiIndex into columns.
Upvotes: 3
Reputation: 81
@Anton_vBR gave a great answer!
However, consider also "geopandas" library as an alternative:
import geopandas
df = geopandas.read_file("yourfile.geojson")
where df will be "class geopandas.GeoDataFrame", which will allow you to manipulate geojson as it is like ordinary pandas's DataFrame (recursively through structures inside)
Upvotes: 5
Reputation: 18916
Something like this??
from pandas.io.json import json_normalize
df = json_normalize(geojson["features"])
coords = 'properties.geometry.coordinates'
df2 = (df[coords].apply(lambda r: [(i[0],i[1]) for i in r[0]])
.apply(pd.Series).stack()
.reset_index(level=1).rename(columns={0:coords,"level_1":"point"})
.join(df.drop(coords,1), how='left')).reset_index(level=0)
df2[['lat','long']] = df2[coords].apply(pd.Series)
df2
Outputs:
index point properties.geometry.coordinates properties.MARKET \
0 0 0 (-74.264948, 42.419877) Albany
1 0 1 (-74.262041, 42.425856) Albany
2 0 2 (-74.261175, 42.427631) Albany
3 0 3 (-74.260384, 42.429253) Albany
4 1 0 (-73.929627, 42.078788) Albany
5 1 1 (-73.929114, 42.081658) Albany
6 2 0 (-74.769198, 43.114089) Albuquerque
7 2 1 (-74.76786, 43.114496) Albuquerque
8 2 2 (-74.766474, 43.114656) Albuquerque
properties.geometry.type lat long
0 Polygon -74.264948 42.419877
1 Polygon -74.262041 42.425856
2 Polygon -74.261175 42.427631
3 Polygon -74.260384 42.429253
4 Polygon -73.929627 42.078788
5 Polygon -73.929114 42.081658
6 Polygon -74.769198 43.114089
7 Polygon -74.767860 43.114496
8 Polygon -74.766474 43.114656
If:
geojson = {'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
Upvotes: 10