Reputation: 63
I was working on recommendation system (RS) in python when I came across a serious problem: I couldn't access the set without changing its order.
e.g. Once I changed a set to list the order gets change. (In recommendation, system order is very important.)
final_prediction=set(df_final)-set(df1)
e.g.
>>> df_final=['a','x','z','p','s','j','b']
>>> df1=['b','j']
>>> set(df_final)-set(df1)
{'p', 'a', 's', 'z', 'x'}
Here df_final
and df1
both are categorical variables
Although I used other approach, I had to scratch my butt's to change the code because it was giving perfect results using set thing and all other things were just working fine. I was in the final phase of my RS, but because of the set order I had to take other approach.
How do we access an set without changing the order?
Upvotes: 5
Views: 18268
Reputation: 123423
Since you need ordered sets, I recommend using the ActiveState recipe the Python documentation recommends in the "See also:" at the very end.
If you put the recipe's code in a separate file named orderedset.py
, it can be import
ed as a module and used like this:
from orderedset import OrderedSet # See https://code.activestate.com/recipes/576694
df_final = ['a','x','z','p','s','j','b']
df1 = ['b','j']
print(OrderedSet(df_final) - OrderedSet(df1)) # -> OrderedSet(['a', 'x', 'z', 'p', 's'])
Upvotes: 1
Reputation: 25023
The lists, the first one is ordered
>>> df_final=['a','x','z','p','s','j','b']
>>> df1=['b','j']
This works but it's O(N×M)
>>> [cat_var for cat_var in df_final if cat_var not in df1]
['a', 'x', 'z', 'p', 's']
This has a setup cost but it's O(N), if both lists are long...
>>> sdf1 = set(df1)
>>> [cat_var for cat_var in df_final if cat_var not in sdf1]
['a', 'x', 'z', 'p', 's']
Upvotes: 0
Reputation: 164623
set
is an unordered collection. For an ordered collection, you can use list
or tuple
. You now have a few options. Your choice should depend on whether you expect repeats in df_final
. If you have no repeats, you can use a list comprehension:
df1_set = set(df1)
res1 = [i for i in df_final if i not in df1_set]
# ['a', 'x', 'z', 'p', 's']
If you have repeats in df_final
, then you need unique items with ordering maintained. For this, you can use toolz.unique
, which is equivalent to the unique_everseen
recipe found in the docs:
from toolz import unique
res2 = [i for i in unique(df_final) if i not in df1_set]
Upvotes: 5