Reputation: 305
I'm working with a third party dataset that includes location data. I'm trying to extract the Longitude and Latitude coordinates from the location column. As stated in their doc:
The
location
column is of thepoint
datatype.
When I ingest the data and view the location column it appears as below
>>> results_df["location"].iloc[1]
{'coordinates': array([-97.707172829666, 30.385328900508]), 'type': 'Point'}
Not all rows have a value, some are None.
I know I can do this to get the values for a specific row:
>>> results_df["location"].iloc[1]['coordinates'][0]
-97.707172829666
>>> results_df["location"].iloc[1]['coordinates'][1]
30.385328900508
But I'd like to create two new columns, longitude and latitude, for the entire dataframe. For each row, how can I extract the coordinates to populate the new column?
Example:
latitude | longitude |
---|---|
30.385328900508 | -97.707172829666 |
39.395338984595 | -74.587966507573 |
None | None |
42.396358150943 | -104.664962196304 |
None | None |
EDIT: I am working through the Pyspark Pandas API.
Upvotes: 0
Views: 105
Reputation: 3010
It should be similar to the above, however I used pyspark.pandas as you mentioned you are using pyspark pandas.
EDIT
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
# Sample data (this matches your original structure)
sample_data = {
'location': [
{'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'},
None,
{'coordinates': [-74.587966507573, 39.395338984595], 'type': 'Point'},
None,
{'coordinates': [-104.66496219630, 42.396358150943], 'type': 'Point'},
]
}
# Create a PySpark Pandas DataFrame
results_df = ps.DataFrame(sample_data)
# Now add these Series to the DataFrame
results_df[['longitude','latitude']]= results_df.apply(lambda row: [row.location['coordinates'][0],row.location['coordinates'][1]] if row.location is not None else [None,None],axis=1, result_type="expand")
# Show the updated DataFrame
print(results_df)
Upvotes: 0
Reputation: 3002
you can use lambda and apply for this:
import pandas as pd
def extract_coordinates(df):
get_longitude = lambda x: x['coordinates'][0] if isinstance(x, dict) and 'coordinates' in x else None
get_latitude = lambda x: x['coordinates'][1] if isinstance(x, dict) and 'coordinates' in x else None
df['longitude'] = df['location'].apply(get_longitude)
df['latitude'] = df['location'].apply(get_latitude)
return df
sample_data = {
'location': [
{'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'},
{'coordinates': [-74.587966507573, 39.395338984595], 'type': 'Point'},
None,
{'coordinates': [-104.66496219630, 42.396358150943], 'type': 'Point'},
None
]
}
results_df = pd.DataFrame(sample_data)
print("\nSample row from the dataset:")
print(results_df['location'].iloc[0])
results_df = extract_coordinates(results_df)
print("\nDataFrame with extracted coordinates:")
print(results_df)
$ python extract-ele-from-data.py
Sample row from the dataset:
{'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'}
DataFrame with extracted coordinates:
location longitude latitude
0 {'coordinates': [-97.7071728296666, 30.3853289... -97.707173 30.385329
1 {'coordinates': [-74.587966507573, 39.39533898... -74.587967 39.395339
2 None NaN NaN
3 {'coordinates': [-104.6649621963, 42.396358150... -104.664962 42.396358
4 None NaN NaN
Upvotes: 1