Reputation: 20119
I have a dataframe in Pandas like this:
id loc
40 100005090 -38.229889,-72.326819
188 100020985 ut: -33.442101,-70.650327
249 10002732 ut: -33.437478,-70.614637
361 100039605 ut: 10.646041,-71.619039 \N
440 100048229 4.666439,-74.071554
I need to extract the gps points. I first ask for a contain of a certain regex (found here in SO, see below) to match all cells that have a "valid" lat/long value. However, I also need to extract
these numbers and either put them on a series of their own (and then call split on the comma) or put them in two new pandas series. I have tried the following for the extraction part:
ids_with_latlong["loc"].str.extract("[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$")
but it looks, because of the output, that the reg exp is not doing the matching greedily, because I get something like this:
0 1 2 3 4 5 6 7 8
40 38.229889 .229889 NaN 72.326819 NaN 72 NaN 72 .326819
188 33.442101 .442101 NaN 70.650327 NaN 70 NaN 70 .650327
Obviously it's matching more than I want (I would just need cols 0, 1, and 4), but simply dropping them is too much of a hack for me to do. Notice that the extract function also got rid of the +/- signs at the beginning. If anyone has a solution, I'd really appreciate.
Upvotes: 1
Views: 714
Reputation: 30424
@HYRY's answer looks pretty good to me. This is just an alternate approach that uses built in pandas methods rather than a regex approach. I think it's a little simpler to read though I'm not sure if it will be sufficiently general for all your cases (it works fine on this sample data though).
df['loc'] = df['loc'].str.replace('ut: ','')
df['lat'] = df['loc'].apply( lambda x: x.split(',')[0] )
df['lon'] = df['loc'].apply( lambda x: x.split(',')[1] )
id loc lat lon
0 100005090 -38.229889,-72.326819 -38.229889 -72.326819
1 100020985 -33.442101,-70.650327 -33.442101 -70.650327
2 10002732 -33.437478,-70.614637 -33.437478 -70.614637
3 100039605 10.646041,-71.619039 10.646041 -71.619039
4 100048229 4.666439,-74.071554 4.666439 -74.071554
As a general suggestion for this type of approach you might think about doing in in the following steps:
1) remove extraneous characters with replace
(or maybe this is where the regex is best)
2) split into pieces
3) check that each piece is valid (all you need to do is check that it's a number although you could take an extra step that it falls into the number range of being a valid lat or lon)
Upvotes: 2
Reputation: 97281
You can use (?:)
to ignore the group:
df["loc"].str.extract(r"((?:[\+-])?\d+\.\d+)\s*,\s*((?:[\+-])?\d+\.\d+)")
Upvotes: 1