MyNameHere
MyNameHere

Reputation: 305

For reach row in dataframe, how to extract elements from an array?

I'm working with a third party dataset that includes location data. I'm trying to extract the Longitude and Latitude coordinates from the location column. As stated in their doc:

The location column is of the point datatype.

When I ingest the data and view the location column it appears as below

>>> results_df["location"].iloc[1]
{'coordinates': array([-97.707172829666,  30.385328900508]), 'type': 'Point'}

Not all rows have a value, some are None.

I know I can do this to get the values for a specific row:

>>> results_df["location"].iloc[1]['coordinates'][0]
-97.707172829666
>>> results_df["location"].iloc[1]['coordinates'][1]
30.385328900508

But I'd like to create two new columns, longitude and latitude, for the entire dataframe. For each row, how can I extract the coordinates to populate the new column?

Example:

latitude longitude
30.385328900508 -97.707172829666
39.395338984595 -74.587966507573
None None
42.396358150943 -104.664962196304
None None

EDIT: I am working through the Pyspark Pandas API.

Upvotes: 0

Views: 105

Answers (2)

samhita
samhita

Reputation: 3010

It should be similar to the above, however I used pyspark.pandas as you mentioned you are using pyspark pandas.

EDIT

import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)

# Sample data (this matches your original structure)
sample_data = {
    'location': [
        {'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'},
        None,
        {'coordinates': [-74.587966507573, 39.395338984595], 'type': 'Point'},
        None,
        {'coordinates': [-104.66496219630, 42.396358150943], 'type': 'Point'},
    ]
}

# Create a PySpark Pandas DataFrame
results_df = ps.DataFrame(sample_data)

# Now add these Series to the DataFrame
results_df[['longitude','latitude']]= results_df.apply(lambda row: [row.location['coordinates'][0],row.location['coordinates'][1]] if row.location is not None else [None,None],axis=1, result_type="expand")


# Show the updated DataFrame
print(results_df)

enter image description here

Upvotes: 0

willwrighteng
willwrighteng

Reputation: 3002

you can use lambda and apply for this:

import pandas as pd

def extract_coordinates(df):
    get_longitude = lambda x: x['coordinates'][0] if isinstance(x, dict) and 'coordinates' in x else None
    get_latitude = lambda x: x['coordinates'][1] if isinstance(x, dict) and 'coordinates' in x else None

    df['longitude'] = df['location'].apply(get_longitude)
    df['latitude'] = df['location'].apply(get_latitude)

    return df

sample_data = {
    'location': [
        {'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'},
        {'coordinates': [-74.587966507573, 39.395338984595], 'type': 'Point'},
        None,
        {'coordinates': [-104.66496219630, 42.396358150943], 'type': 'Point'},
        None
    ]
}

results_df = pd.DataFrame(sample_data)
print("\nSample row from the dataset:")
print(results_df['location'].iloc[0])

results_df = extract_coordinates(results_df)
print("\nDataFrame with extracted coordinates:")
print(results_df)

results

$ python extract-ele-from-data.py

Sample row from the dataset:
{'coordinates': [-97.7071728296666, 30.385328900508], 'type': 'Point'}

DataFrame with extracted coordinates:
                                            location   longitude   latitude
0  {'coordinates': [-97.7071728296666, 30.3853289...  -97.707173  30.385329
1  {'coordinates': [-74.587966507573, 39.39533898...  -74.587967  39.395339
2                                               None         NaN        NaN
3  {'coordinates': [-104.6649621963, 42.396358150... -104.664962  42.396358
4                                               None         NaN        NaN

Upvotes: 1

Related Questions