JoshD
JoshD

Reputation: 47

How to efficiently compare rows in a pandas DataFrame?

I have a pandas dataframe containing a record of lightning strikes with timestamps and global positions in the following format:

Index      Date      Time                        Lat      Lon         Good fix?
0          1         20160101  00:00:00.9962692  -7.1961  -60.7604    1
1          2         20160101  00:00:01.0646207  -7.0518  -60.6911    1
2          3         20160101  00:00:01.1102066 -25.3913  -57.2922    1
3          4         20160101  00:00:01.2018573  -7.4842  -60.5129    1
4          5         20160101  00:00:01.2942750  -7.3939  -60.4992    1
5          6         20160101  00:00:01.4431493  -9.6386  -62.8448    1
6          8         20160101  00:00:01.5226157 -23.7089  -58.8888    1
7          9         20160101  00:00:01.5932412  -6.3513  -55.6545    1
8          10        20160101  00:00:01.6736350 -23.8019  -58.9382    1
9          11        20160101  00:00:01.6957858 -24.5724  -57.7229    1

Actual dataframe contains millions of rows. I wish to separate out events which happened far away in space and time from other events, and store them in a new dataframe isolated_fixes. I have written code to calculate the separation of any two events which is as follows:

def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #Uses haversine formula to calculate distance between points, returning a tuple with Boolean closeness statement, and numerical distance
    radlat1 = m.radians(strike1[1][latpos])
    radlon1 = m.radians(strike1[1][lonpos])
    radlat2 = m.radians(strike2[1][latpos])
    radlon2 = m.radians(strike2[1][lonpos])

    a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2)
    c=2*m.atan2((a**0.5),((1-a)**0.5))
    R=6371 #earth radius in km
    d=R*c #distance between points in km
    if d <= defclose:
        return (True,d)
    else:
        return (False,d) 

and for time:

def getdatetime(series,timelabel=2,datelabel=1,timeformat="%X.%f",dateformat="%Y%m%d"):
    time = dt.datetime.strptime(series[1][timelabel][:15], timeformat)
    date = dt.datetime.strptime(str(series[1][datelabel]), dateformat)
    datetime = dt.datetime.combine(date.date(),time.time())
    return datetime


def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)):
    dt1=getdatetime(strike1)
    dt2=getdatetime(strike2)
    timediff=abs(dt1-dt2)
    if timediff<=defclose:
        return(True, timediff)
    else:
        return(False, timediff)

The real problem is how to efficiently compare all events to all other events to determine how many of them are space_close and time_close.

Note that not all events need to be checked, as they are ordered with respect to datetime, so if there was a way to check events 'middle out' and then stop when events were no longer close in time, that would save a lot of operations, but I dont know how to do this.

At the moment, my (nonfunctional) attempt looks like this:

def extrisolfixes(data,filtereddata,defisol=4): 
    for strike1 in data.iterrows():
        near_strikes=-1 #-1 to account for self counting once on each loop
        for strike2 in data.iterrows():
            if are_strikes_space_close(strike1,strike2)[0]==True and are_strikes_time_close(strike1,strike2)[0]==True:
                near_strikes+=1
        if near_strikes<=defisol:
            filtereddata=filtereddata.append(strike1)

Thanks for any help! Am happy to provide clarification if needed.

Upvotes: 6

Views: 5905

Answers (4)

user2187298
user2187298

Reputation: 13

You can use some unsupervised ML algorithms for improving speed. Before using ML algorithms need a do some data transformation. For example:

  1. transform "Date","Timestamp" into one column feature "Timestamp".
  2. It's possible to use raw "Lat","Lon" but it's maybe helpful when we merge them in one. Common approach calculates a distance from some arbitrary point(it' maybe a center of area), sometimes for increasing an importance of the geolocation you can use more than one point for measure distance from theirs . For the distance calculations, you can use get_distance from ysearka.
  3. Data scaling(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler). Try with and without it.

After data preprocessing , you can simply use one of scikit-learn clustering algorithms(http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster), for arrange your data in clusters.KMeans good point for beginning.

Also, pay attention on NearestNeighbors(http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) for search concrete amount of objects in order of similarity.

Upvotes: 0

IanS
IanS

Reputation: 16241

Depending on your data, this might be useful or not. Some strikes may be "isolated" in time, i.e. further away from the strike before and the strike after than the time-threshold. You could use these strikes to separate your data into groups, and you can then process those groups using searchsorted along the lines suggested by ysearka. If your data ends up separated into hundreds of groups, it might save time.

Here is how the code would look like:

# first of all, convert to timestamp
df['DateTime'] = pd.to_datetime(df['Date'].astype(str) + 'T' + df['Time'])

# calculate the time difference with previous and following strike
df['time_separation'] = np.minimum( df['DateTime'].diff().values, 
                                   -df['DateTime'].diff(-1).values)
# using a specific threshold for illustration
df['is_isolated'] = df['time_separation'] > "00:00:00.08"
# define groups
df['group'] = (df['is_isolated'] != df['is_isolated'].shift()).cumsum()
# put isolated strikes into a separate group so they can be skipped
df.loc[df['is_isolated'], 'group'] = -1

Here is the output, with the specific threshold I used:

       Lat      Lon                      DateTime is_isolated  group
0  -7.1961 -60.7604 2016-01-01 00:00:00.996269200       False      1
1  -7.0518 -60.6911 2016-01-01 00:00:01.064620700       False      1
2 -25.3913 -57.2922 2016-01-01 00:00:01.110206600       False      1
3  -7.4842 -60.5129 2016-01-01 00:00:01.201857300        True     -1
4  -7.3939 -60.4992 2016-01-01 00:00:01.294275000        True     -1
5  -9.6386 -62.8448 2016-01-01 00:00:01.443149300       False      3
6 -23.7089 -58.8888 2016-01-01 00:00:01.522615700       False      3
7  -6.3513 -55.6545 2016-01-01 00:00:01.593241200       False      3
8 -23.8019 -58.9382 2016-01-01 00:00:01.673635000       False      3
9 -24.5724 -57.7229 2016-01-01 00:00:01.695785800       False      3

Upvotes: 2

ysearka
ysearka

Reputation: 3855

This answer might not be very efficient. I'm facing a very similar problem and am currently looking for something more efficient than what I do because it still takes one hour to compute on my dataframe (600k rows).

I first suggest you don't even think about using for loops like you do. You might not be able to avoid one (which is what I do using apply), but the second can (must) be vectorized.

The idea of this technique is to create a new column in the dataframe storing whether there is another strike nearby (temporarly and spatially).

First let's create a function calculating (with numpy package) the distances between one strike (reference) and all the others:

def get_distance(reference,other_strikes):

    radius = 6371.00085 #radius of the earth
    # Get lats and longs in radians, then compute deltas:
    lat1 = np.radians(other_strikes.Lat)
    lat2 = np.radians(reference[0])
    dLat = lat2-lat1
    dLon = np.radians(reference[1]) - np.radians(other_strikes.Lon)
    # And compute the distance (in km)
    a = np.sin(dLat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dLon / 2.0) ** 2
    return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius

Then create a function that will check whether, for one given strike, there is at least another nearby:

def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes):
    dmin = date_ref - np.timedelta64(delta_t,'D')
    dmax = date_ref + np.timedelta64(delta_t,'D')

    #Let's first find all strikes within a temporal range
    ind = other_strikes.Date.searchsorted([date_ref-delta_t,date_ref+delta_t])
    nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy()

    if len(nearby_strikes) == 0:
        return False

    #Let's compute spatial distance now:
    nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['Lat','Lon']])

    nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d]

    return (len(nearbystrikes)>0)

Now that all your functions are ready, you can use apply on your dataframe:

data['presence of nearby strike'] = data[['Date','Lat','Lon']].apply(lambda x: is_there_a_strike_nearby(x['Date'],x['Lat'],x['Long'], delta_t, delta_d,data)

And that's it, you have now created a new column in your dataframe that indicates whether your strike is isolated (False) or not (True), creating your new dataframe from this is easy.

The problem of this method is that it still is long to turn. There are ways to make it faster, for instance change is_there_a_strike_nearby to take as other arguments your data sorted by lat and long, and using other searchsorted to filter over Lat and Long before computing the distance (for instance if you want the strikes within a range of 10km, you can filter with a delta_Lat of 0.09).

Any feedback over this method is more than welcome!

Upvotes: 3

dodell
dodell

Reputation: 490

This is one of those problems that seems easy initially but the more your think about it the more your head melts! We have essentially got a three-dimensional (Lat, Lon, Time) clustering problem, followed by filtering based on cluster size. There are a number of questions a little like this (though more abstract) and the responses tend to involve scipy. Check out this one. I would also check out fuzzy c-means clustering. Here is the skfuzzy example.

In your case though, the geodesic distance might be key, in which case you might not want to disregard computing distance. The high-maths examples sort of miss the point.

If accuracy is not important there may be more basic ways of doing it, like creating arbitrary time 'bins' using dataframe.cut or similar. There would be an optimum size between speed and accuracy. For instance, if you cut into t/4 bins (1800 seconds), and take a 4 bins gap as being far away in time, then your actual time difference could be 5401-8999. An example of cutting. Applying something similar to the lon and lat co-ordinates, and doing calculations on the approximate values, will be faster.

Hope that helps.

Upvotes: 1

Related Questions