Reputation: 1592
I have very big dataframe that a row in it typically looks like that:
>>>ID name year location
0 341 Dali 1995 {{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}
...
df['geolocation'] = df['geolocation'].str.replace(r'\D', '') I would like to break the location column into many columns that contain only the numebrs and to get rid of the "latitude", "longtitude" and all the symbols in between.
I thouhgt to do it first by extract only the digits like this:
df['location'] = df['location'].str.extract('(\d+)', expand=False)
but that for some reason gave me the location column as ine integer number.
I didn't want to use split because the symbols in the middle are not consistent, somtimes you have this sequence : {{"{" and sometimes it can be only "{"{" and I can't really trace all the possibilites that might be there. also the number of digits is different in the different rows.
My desired result should look like this:
>>>ID name year lat long lat1 long1 ....
0 341 Dali 1995 9.4714611480000004 4.3520187860000004 9.4720611479999999 4.3520187860000004
edit: I have also tried this:
df['location'] = df['location'].str.replace(r'\D', '')
which preserved the numbers but gave me one very ling number ,without preservring the "." and also without haveing any space between the numbers
Upvotes: 0
Views: 80
Reputation: 1450
I have used regex match for efficient extraction of latitudes and longitudes. This can obtained using following code.
import re
import pandas as pd
df = pd.DataFrame({
'ID': [341,321],
'name':['Dali','daLi'],
'year':[1995, 1996],
'location':['{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}',
'{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}']
})
Solution
df_new = df.location.apply(lambda x: re.findall(r"\d+\.*\d*",x))
df_new = pd.DataFrame(df_new.to_list(), columns=['lat1','long1','lat2','long2'])
pd.concat([df.iloc[:,0:3], df_new], axis=1)
output
ID name year lat1 long1 lat2 long2
0 341 Dali 1995 9.4714611480000004 4.3520187860000004 9.4720611479999999 4.3520187860000004
1 321 daLi 1996 9.4714611480000004 4.3520187860000004 9.4720611479999999 4.3520187860000004
Upvotes: 2