Reut
Reut

Reputation: 1592

Extract from column only the digits and split to different columns

I have very big dataframe that a row in it typically looks like that:

>>>ID    name    year    location
0  341   Dali    1995   {{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}
...

df['geolocation'] = df['geolocation'].str.replace(r'\D', '') I would like to break the location column into many columns that contain only the numebrs and to get rid of the "latitude", "longtitude" and all the symbols in between.

I thouhgt to do it first by extract only the digits like this:

df['location'] = df['location'].str.extract('(\d+)', expand=False)

but that for some reason gave me the location column as ine integer number.

I didn't want to use split because the symbols in the middle are not consistent, somtimes you have this sequence : {{"{" and sometimes it can be only "{"{" and I can't really trace all the possibilites that might be there. also the number of digits is different in the different rows.

My desired result should look like this:

>>>ID    name    year    lat                 long                     lat1          long1 ....
0  341   Dali    1995    9.4714611480000004  4.3520187860000004 9.4720611479999999 4.3520187860000004

edit: I have also tried this:

df['location'] = df['location'].str.replace(r'\D', '')

which preserved the numbers but gave me one very ling number ,without preservring the "." and also without haveing any space between the numbers

Upvotes: 0

Views: 80

Answers (1)

Lawhatre
Lawhatre

Reputation: 1450

I have used regex match for efficient extraction of latitudes and longitudes. This can obtained using following code.

import re
import pandas as pd

df = pd.DataFrame({
    'ID': [341,321],
    'name':['Dali','daLi'],
    'year':[1995, 1996],
    'location':['{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}',
                '{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}']
})

Solution

df_new = df.location.apply(lambda x: re.findall(r"\d+\.*\d*",x))
df_new = pd.DataFrame(df_new.to_list(), columns=['lat1','long1','lat2','long2'])
pd.concat([df.iloc[:,0:3], df_new], axis=1)

output

    ID  name    year    lat1                long1               lat2                long2
0   341 Dali    1995    9.4714611480000004  4.3520187860000004  9.4720611479999999  4.3520187860000004
1   321 daLi    1996    9.4714611480000004  4.3520187860000004  9.4720611479999999  4.3520187860000004

Upvotes: 2

Related Questions