Reputation: 23098
In a previous question, I was asking how to match values from this DataFrame source
:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 13.0 09.0
3 110 23.0 08.0
4 110 13.0 09.0
5 110 12.0 10.0
6 110 12.0 02.0
7 120 11.0 11.0
8 120 12.0 10.0
9 120 13.0 09.0
10 120 14.0 08.0
11 130 12.0 10.0
And keep only those whose coords are in this second DataFrame coords
:
lat lon
0 12.0 10.0
1 13.0 09.0
But this time I'd like to match each car_id
who gets:
coords
So that the resulting DataFrame result
would be:
car_id
1 100
2 120
# 110 has all the values from coords, but not in the same order
# 130 doesn't have all the values from coords
Is there a way to achieve this result in a vectorized way, avoiding going through a lot of loops and conditionals?
Upvotes: 1
Views: 75
Reputation: 294576
plan
groupby
'car_id'
and evaluate each subsetinner
merge
we should see two things
coords
def duper(df):
m = df.merge(coords)
c = pd.concat([m, coords])
# we put the merged rows first and those are
# the ones we'll keep after `drop_duplicates(keep='first')`
# `keep='first'` is the default, so I don't pass it
c1 = (c.drop_duplicates().values == coords.values).all()
# if `keep=False` then I drop all duplicates. If I got
# everything in `coords` this should be empty
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values
array([100, 120])
slight alternative
def duper(df):
m = df.drop('car_id', 1).merge(coords)
c = pd.concat([m, coords])
c1 = (c.drop_duplicates().values == coords.values).all()
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.groupby('car_id').filter(duper).car_id.unique()
Upvotes: 1
Reputation: 33
This isn't pretty, but what if you did something like this:
df2 = DataFrame(df, copy=True)
df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1)
df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True)
print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True))
car_id
0 100
1 120
And this would be the general case:
raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130],
'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12],
'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10],
}
df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon'])
raw_data = {
'lat': [10, 12, 13],
'lon': [15, 10, 9],
}
coords = pd.DataFrame(raw_data, columns = ['lat', 'lon'])
def submatch(df, match):
df2 = DataFrame(df['car_id'])
for x in range(match.shape[0]):
df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x)
n = match.shape[0]
cols = [item for sublist in
[['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)]
for item in sublist]
df2.set_index(cols, inplace=True)
return df2.loc[tuple(match.stack().values)].reset_index(drop=True)
print(submatch(df, coords))
car_id
0 100
Upvotes: 1