Reputation: 497
I have a dataframe that I harvested from wikipedia that has lat long coords as a col, and I'm trying to remove occurences of strings between parens that occur in some of the rows but not all.
Sample:
25 53.74333, 91.38583
47 -10.167, 148.700 (Abau Airport)
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667 (Abdelhafid Boussouf Bou Ch...)
I've tried doing this
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: float(x.replace('[^\d.]', '')))
Which throws this error, basically indicating that not all have characters to remove, which is fine but if I try implementing a for loop to use a try/catch then I'll have to map and in the case of this dataframe I don't have a unique ID to use as a key.
ValueError: could not convert string to float: '53.58472, 14.90222'
Removing the float cast and doing this:
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: x.replace('[^\d.]', ''))
The code executes, but no changes are made to which I'm not sure why.
The expected output should look like this:
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667
Upvotes: 1
Views: 594
Reputation: 150735
This is just a simple regex:
df.Lat_Lon.str.extract('^([-\d\.,\s]+)')
Output:
0
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667
You can go as far as extracting both the latitude and longitude:
df.Lat_Lon.str.extract('^(?P<Lat>[-\d\.]+),\s(?P<Lon>[-\d\.]+)')
Output:
Lat Lon
25 53.74333 91.38583
47 -10.167 148.700
155 16.63611 -14.19028
414 49.02528 -122.36000
1 16.01111 43.17778
176 35.34167 1.46667
Upvotes: 1
Reputation: 728
Instead of using python's str.replace
use pandas DataFrame.replace
with regex=True
option. So instead your line should be:
big_with_l['Lat_Lon'] = big_with_ll['Lat_Lon'].replace(r'[^\d.]', '', regex=True)
Just a heads up, I assumed your regex string was well formed.
Upvotes: 1