Dervin Thunk
Dervin Thunk

Reputation: 20119

How to *extract* latitud and longitude greedily in Pandas?

I have a dataframe in Pandas like this:

        id          loc
 40     100005090   -38.229889,-72.326819   
 188    100020985   ut: -33.442101,-70.650327   
 249    10002732    ut: -33.437478,-70.614637   
 361    100039605   ut: 10.646041,-71.619039    \N
 440    100048229   4.666439,-74.071554

I need to extract the gps points. I first ask for a contain of a certain regex (found here in SO, see below) to match all cells that have a "valid" lat/long value. However, I also need to extract these numbers and either put them on a series of their own (and then call split on the comma) or put them in two new pandas series. I have tried the following for the extraction part:

ids_with_latlong["loc"].str.extract("[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$")

but it looks, because of the output, that the reg exp is not doing the matching greedily, because I get something like this:

    0   1            2      3   4           5   6       7    8
    40  38.229889   .229889 NaN 72.326819   NaN 72  NaN 72  .326819
    188 33.442101   .442101 NaN 70.650327   NaN 70  NaN 70  .650327

Obviously it's matching more than I want (I would just need cols 0, 1, and 4), but simply dropping them is too much of a hack for me to do. Notice that the extract function also got rid of the +/- signs at the beginning. If anyone has a solution, I'd really appreciate.

Upvotes: 1

Views: 714

Answers (2)

JohnE
JohnE

Reputation: 30424

@HYRY's answer looks pretty good to me. This is just an alternate approach that uses built in pandas methods rather than a regex approach. I think it's a little simpler to read though I'm not sure if it will be sufficiently general for all your cases (it works fine on this sample data though).

df['loc'] = df['loc'].str.replace('ut: ','')
df['lat'] = df['loc'].apply( lambda x: x.split(',')[0] )
df['lon'] = df['loc'].apply( lambda x: x.split(',')[1] )

          id                    loc         lat         lon
0  100005090  -38.229889,-72.326819  -38.229889  -72.326819
1  100020985  -33.442101,-70.650327  -33.442101  -70.650327
2   10002732  -33.437478,-70.614637  -33.437478  -70.614637
3  100039605   10.646041,-71.619039   10.646041  -71.619039
4  100048229    4.666439,-74.071554    4.666439  -74.071554

As a general suggestion for this type of approach you might think about doing in in the following steps:

1) remove extraneous characters with replace (or maybe this is where the regex is best)

2) split into pieces

3) check that each piece is valid (all you need to do is check that it's a number although you could take an extra step that it falls into the number range of being a valid lat or lon)

Upvotes: 2

HYRY
HYRY

Reputation: 97281

You can use (?:) to ignore the group:

df["loc"].str.extract(r"((?:[\+-])?\d+\.\d+)\s*,\s*((?:[\+-])?\d+\.\d+)")

Upvotes: 1

Related Questions