Reputation: 2520
I am reading multiple complex json files and trying to extract geo coordinates. I cannot attach the file itself right now, but I can print the tree here. The file has several hundred options and some objects repeat.
Please see the structure of the file in .txt
format.
When I read the json with Spark in Python, it shows me these coordinates in coordinates
column and it is there.
It is stored in coordinates
column. Please see a proof.
I am obviously trying to reduce the number of columns and select some columns.
The last two columns are my geo coordinates. I tried both coordinates
and geo
and also coordinates.coordinates
with geo.coordinates
. Both options do not work.
df_tweets = tweets.select(['text',
'user.name',
'user.screen_name',
'user.id',
'user.location',
'place.country',
'place.full_name',
'place.name',
'user.followers_count',
'retweet_count',
'retweeted',
'user.friends_count',
'entities.hashtags.text',
'created_at',
'timestamp_ms',
'lang',
'coordinates.coordinates', # or just `coordinates`
'geo.coordinates' # or just `geo`
])
In the first case with coordinates
and geo
I get the following, printing the schema:
df_tweets.printSchema()
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- geo: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
When I do coordinates.coordinates
and geo.coordinates
, I get
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
When I print both dataframes in Pandas, none of them gives me coordinates, I still have None
.
How to extract geo coordinates properly?
Upvotes: 0
Views: 136
Reputation: 5648
If I look at my dataframe with tweet data, I see it like this
In [44]: df[df.coordinates.notnull()]['coordinates']
Out[44]:
98 {'type': 'Point', 'coordinates': [-122.32111, ...
99 {'type': 'Point', 'coordinates': [-122.32111, ...
Name: coordinates, dtype: object
So it's a dictionary that has to be parsed
tweets_coords = df[df.coordinates.notnull()]['coordinates'].tolist()
for coords in tweets_coords:
print(coords)
print(coords['coordinates'])
print(coords['coordinates'][0])
print(coords['coordinates'][1])
Output:
{'type': 'Point', 'coordinates': [-122.32111, 47.62366]}
[-122.32111, 47.62366]
-122.32111
47.62362
{'type': 'Point', 'coordinates': [-122.32111, 47.62362]}
[-122.32111, 47.62362]
-122.32111
47.62362
You can setup a lambda function in apply()
to parse these out row by row, otherwise you can use list comprehension using what i've provided as the basis for your analysis.
All that said, maybe check this first...
Where you are using coordinates.coordinates
and geo.coordinates
, try coordinates['coordinates']
and geo['coordinates']
Upvotes: 1