Jivan
Jivan

Reputation: 23098

Merge DataFrames with ordering criteria

In a previous question, I was asking how to match values from this DataFrame source:

     car_id     lat     lon
0    100        10.0    15.0
1    100        12.0    10.0
2    100        13.0    09.0
3    110        23.0    08.0
4    110        13.0    09.0
5    110        12.0    10.0
6    110        12.0    02.0
7    120        11.0    11.0
8    120        12.0    10.0
9    120        13.0    09.0
10   120        14.0    08.0
11   130        12.0    10.0

And keep only those whose coords are in this second DataFrame coords:

     lat     lon
0    12.0    10.0
1    13.0    09.0

But this time I'd like to match each car_id who gets:

So that the resulting DataFrame result would be:

     car_id
1    100
2    120

# 110 has all the values from coords, but not in the same order
# 130 doesn't have all the values from coords

Is there a way to achieve this result in a vectorized way, avoiding going through a lot of loops and conditionals?

Upvotes: 1

Views: 75

Answers (2)

piRSquared
piRSquared

Reputation: 294576

plan

  • we will groupby 'car_id' and evaluate each subset
  • after an inner merge we should see two things
    1. the resultant merged dataframe should have the same values as coords
    2. the resultant merged dataframe should cover everything



def duper(df):
    m = df.merge(coords)
    c = pd.concat([m, coords])
    # we put the merged rows first and those are
    # the ones we'll keep after `drop_duplicates(keep='first')`
    # `keep='first'` is the default, so I don't pass it
    c1 = (c.drop_duplicates().values == coords.values).all()

    # if `keep=False` then I drop all duplicates.  If I got
    # everything in `coords` this should be empty
    c2 = c.drop_duplicates(keep=False).empty
    return c1 & c2

source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values

array([100, 120])

slight alternative

def duper(df):
    m = df.drop('car_id', 1).merge(coords)
    c = pd.concat([m, coords])
    c1 = (c.drop_duplicates().values == coords.values).all()
    c2 = c.drop_duplicates(keep=False).empty
    return c1 & c2

source.groupby('car_id').filter(duper).car_id.unique()

Upvotes: 1

nbraun
nbraun

Reputation: 33

This isn't pretty, but what if you did something like this:

df2 = DataFrame(df, copy=True)
df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1)
df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True)
print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True))

   car_id
0     100
1     120

And this would be the general case:

raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130],
            'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12],
            'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10],
           }
df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon'])

raw_data = {
             'lat': [10, 12, 13],
             'lon': [15, 10, 9],
           }

coords = pd.DataFrame(raw_data, columns = ['lat', 'lon'])

def submatch(df, match):
    df2 = DataFrame(df['car_id'])
    for x in range(match.shape[0]):
        df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x)

    n = match.shape[0]
    cols = [item for sublist in
        [['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)]
        for item in sublist]

    df2.set_index(cols, inplace=True)
    return df2.loc[tuple(match.stack().values)].reset_index(drop=True)

print(submatch(df, coords))

   car_id
0     100

Upvotes: 1

Related Questions