Reputation: 691
I have been trying to repeat the Zeppelin notebook Magellan example but using assets with geo-location information (assets DF) and trying to map them to zip-codes (zipcode DF). I got the zip-code shapefile from USGS and located it into Spark.
Here is what the asset DF looks like. It consists of an asset ID and a point on a map.
+---------+--------------------+
| asset| point|
+---------+--------------------+
| 10|Point(-96.7595319...|
| 11|Point(4.7115951, ...|
The zipcode DF is a built up from the USGS shapefile for the US zipcodes. This is were i use Magellan
val zipcodes = magellanContext.read.format("magellan").
load("magellan_us_states").
select($"polygon", $"metadata").
cache()
The results are the zipcode DF are
+--------------------+--------------------+
| polygon| metadata|
+--------------------+--------------------+
|Polygon(5, Wrappe...|Map(YEAR_ADM -> ...|
|Polygon(5, Wrappe...|Map(YEAR_ADM -> ...|
I then join the two DF together and do a query
val joined = zipcodes.
join(assets).
where($"point" within $"polygon").
select($"asset", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "state").
drop("k").
cache()
the results are as follows:
+--------+--------------------+
| asset#| state|
+--------+--------------------+
|10 |Arizona |
|10 | 48|
|10 | 1903|
|10 | 04|
|10 | 23.753|
|10 | February |
|10 | 1912|
|10 | 28.931|
|10 | 14|
|11 |North Carolina |
...
The problem is that when i explode the metadata i only want the states. How do i explode that data so that i only end up with table that looks like this -
+--------+--------------------+
| asset#| state|
+--------+--------------------+
|10 |Arizona |
|11 |North Carolina |
|12 |Arizona |
...
Upvotes: 1
Views: 466
Reputation: 330343
How do i explode that data so that i only end up with table that looks like this
Simply don't use explode
. Instead you simply select
the field of interest:
df.select($"asset", $"metadata".getItem("state").alias("state"))
Upvotes: 2