Reputation: 11
Assuming: I have two sets of tuples, also either set might not contain the same amount of tuples. Is there any way I can get this to work without iterating over both sets and comparing every entry from one set to every entry of the other set?
For example, I´d like to know whether three entries of one tuple are in any of the tuples of the other set, this is what I´ve tried (example code):
s1 = set()
s2 = set()
s1.add(tuple(["a", "b", "c", "e"]))
s1.add(tuple(["d", "e", "f", "h"]))
s2.add(tuple(["a", "b", "c", "d"]))
s2.add(tuple(["d", "e", "f", "g"]))
s2.add(tuple(["m", "n", "o", "p"]))
for x in s1:
if x[0:3] in s2:
print(x)
This won´t work.
I´m asking because the sets have thousands of entries and iterating over both takes way too long and I can´t seem to figure out a smart way to do it.
Edit for clarification: Each tuple always has the same amount of entries, in my case 4. In my case I´d like to know how to check any arbitrary combination [0:2], [1:3], [0:x]. I need confirmation that for example [0:3] in one tuple is the same as [0:3] in another one.
Upvotes: 1
Views: 301
Reputation: 2484
You could use pandas for this. Pandas is optimised to run very fast using C bindings.
df1
0 1 2 3
0 a b c e
1 d e f h
2 x e f h
3 y u d h
df2
0 1 2 3 4 5
0 a b c d None None
1 d e f g None None
2 m n o p None None
3 b c d y u d
I've assumed above that the tuples may be of varying length.
import pandas as pd
df1 = pd.DataFrame([tuple("a", "b", "c", "e"), tuple("d", "e", "f", "h"), tuple("x", "e", "f", "h"), tuple("y", "u", "d", "h") ])
df2 = pd.DataFrame([tuple("a", "b", "c", "d"), tuple("d", "e", "f", "g"), tuple("m", "n", "o", "p"), tuple("b", "c", "d", "y", "u", "d")])
# checks if the first 3 column values are in the second frame
df1[[0,1,2]].isin(df2).any(axis=1)
0 True
1 True
2 False
3 True
dtype: bool
So it's matched abc, def, and yud in any location in the second dataframe. You could generalise this approach to look for subsets of the first dataframe other than cols 0:2. This could look something like this:
for col_max in range(0, len(df1.columns)):
col_names = [col_index for col_index in range(0, col_max)]
print(df1[col_names].isin(df2).any(axis=1))
Upvotes: 1