Sebastian Goslin
Sebastian Goslin

Reputation: 497

Removing unwanted strings from numeric float pandas string column

I have a dataframe that I harvested from wikipedia that has lat long coords as a col, and I'm trying to remove occurences of strings between parens that occur in some of the rows but not all.

Sample:

25     53.74333, 91.38583
47    -10.167, 148.700 (Abau Airport)
155    16.63611, -14.19028
414    49.02528, -122.36000
1      16.01111, 43.17778
176    35.34167, 1.46667 (Abdelhafid Boussouf Bou Ch...)

I've tried doing this

big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: float(x.replace('[^\d.]', '')))

Which throws this error, basically indicating that not all have characters to remove, which is fine but if I try implementing a for loop to use a try/catch then I'll have to map and in the case of this dataframe I don't have a unique ID to use as a key.

ValueError: could not convert string to float: '53.58472, 14.90222' 

Removing the float cast and doing this:

big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: x.replace('[^\d.]', ''))

The code executes, but no changes are made to which I'm not sure why.

The expected output should look like this:

25     53.74333, 91.38583
47    -10.167, 148.700
155    16.63611, -14.19028
414    49.02528, -122.36000
1      16.01111, 43.17778
176    35.34167, 1.46667

Upvotes: 1

Views: 594

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

This is just a simple regex:

 df.Lat_Lon.str.extract('^([-\d\.,\s]+)')

Output:

                        0
25     53.74333, 91.38583
47       -10.167, 148.700
155   16.63611, -14.19028
414  49.02528, -122.36000
1      16.01111, 43.17778
176     35.34167, 1.46667

You can go as far as extracting both the latitude and longitude:

df.Lat_Lon.str.extract('^(?P<Lat>[-\d\.]+),\s(?P<Lon>[-\d\.]+)')

Output:

          Lat         Lon
25   53.74333    91.38583
47    -10.167     148.700
155  16.63611   -14.19028
414  49.02528  -122.36000
1    16.01111    43.17778
176  35.34167     1.46667

Upvotes: 1

R. Arctor
R. Arctor

Reputation: 728

Instead of using python's str.replace use pandas DataFrame.replace with regex=True option. So instead your line should be:

big_with_l['Lat_Lon'] = big_with_ll['Lat_Lon'].replace(r'[^\d.]', '', regex=True)

Just a heads up, I assumed your regex string was well formed.

Upvotes: 1

Related Questions