Writing python dictionary from dataframe checking intersections between values

Question

I have to write three dictionaries from the same csv file. The input file is

col1   col2   value
item1  a      value1
item1  b      value2
item1  c      value3
item2  a      value4
item2  c      value5
...

And I need these three dictionaries:

1.

dict1
item1:set(a,b,c)
item2:set(a,c)
...

2.

dict2
set(item1,a):value1
set(item1,b):value2
set(item1,c):value3
set(item2,a):value4
set(item2,c):value5

I need to use sets as values in the first dictionary because then I will have to perform intersections between values and I think set is the more suitable type.

My final dictionary, resulting from these intersections, will be something like:

3.

dict3
(item1,item2):value1+value3

It is probably easier to understand just by looking at the examples, but let me explain it: basically dict3 considers the pairwise intersections between the values of dict1, which in my example is only a, and then does dict2.get((item1,a))+dict2.get((item2,a)) and assigns it as value to the couple (item1,item2). If item1 and item2 had in common another element, let's say d, the value for (item1,item2) would then be dict2.get((item1,a))+dict2.get((item2,a))+dict2.get((item1,d))+dict2.get((item2,d)). Please note that in the real dataset col1 and col2 items are strings. This calculation is repeated checking every pairwise intersection of values in dict1.

What's the easiest way to get these dictionaries? I am more comfortable using pandas, so I'd ask you to suggest solutions using a dataframe, but I can accept anything which reads directly from the external file as well, since this comes into play only in the very first stage.

EDIT I should probably clarify better that I need a pairwise intersection, and this issue doesn't arise with the example I gave. Just to have a better example on which one can work, try:

df=pd.DataFrame(columns=['col1','col2','value']) 
df.col1=['item1','item1','item1','item2','item2','item3','item3'] 
df.col2=['a','b','c','a','d','a','c'] 
df.value=[1,2,3,4,5,6,7]

and try to get as a result:

dict3
(item1,item2):5
(item1,item3):17
(item2,item3):10

It seems like a very complex problem: I found something on pairwise set intersection here but I can't find a final solution.

scnerd · Accepted Answer

The following works if you don't care about how many items match up (see below for how to do this with only pairwise matching):

In [1]: df
Out[1]:
    col1 col2  value
0  item1    a      1
1  item1    b      2
2  item1    c      4
3  item2    a      8
4  item2    d     16

In [2]: dict1 = df.groupby('col1').apply(lambda vals: set(vals.col2))

In [3]: dict1
Out[3]:
col1
item1    {b, c, a}
item2       {a, d}
dtype: object

In [4]: dict3 = {tuple(sorted(vals.col1)): sum(vals.value)
                 for kind, vals in df.groupby('col2')
                 if kind in set.intersection(*[dict1[itm] for itm in vals.col1])
                 and len(vals) > 1}

In [5]: dict3
Out[5]: {('item1', 'item2'): 9}

EDIT:

To do this in just a pairwise fashion, let's use a modified join:

df.reset_index(inplace=True)

merged = pd.merge(df, df, on='col2')
merged = merged[merged.index_x < merged.index_y]

idxs, vals = zip(*[(tuple(sorted([row.col1_x, row.col1_y])), row.value_x + row.value_y)
    for row in merged.itertuples(False)])
final = pd.DataFrame(list(vals), index=list(idxs))

Our values are:

In [1]: df
Out[1]:
   index   col1 col2  value
0      0  item1    a      1
1      1  item1    b      2
2      2  item1    c      3
3      3  item2    a      4
4      4  item2    d      5
5      5  item3    a      6
6      6  item3    c      7

In [2]: merged
Out[2]:
    index_x col1_x col2  value_x  index_y col1_y  value_y
1         0  item1    a        1        3  item2        4
2         0  item1    a        1        5  item3        6
5         3  item2    a        4        5  item3        6
11        2  item1    c        3        6  item3        7

In [3]: final
Out[3]:
                 0
(item1, item2)   5
(item1, item3)   7
(item2, item3)  10
(item1, item3)  10

The merge operation might soak up a fair bit of memory depending on your data size. Pandas doesn't support a more complex merge function that would take our inequality into account as well, and as far as I know there's no memory-efficient one-step way to do this without coding the join function yourself (which you could do with a dictionary comprehension--see below). Really, though, if data size is that much of an issue, you might want to consider using Spark dataframes instead, which should be much more scalable (I don't know if that's necessarily true if you're only on one computer, but I suspect it is) without requiring dramatically different code.

To do this with a dictionary comprehension, I think your code could look something like the following:

final_dict = {
    tuple(sorted((a.col1, b.col1))): a.value + b.value
    for a_index, a in df.iterrows()
    for b_index, b in df.iterrows()
    if a_index < b_index
    and a.col2 == b.col2
}

# {('item1', 'item2'): 5, ('item1', 'item3'): 10, ('item2', 'item3'): 10}

Note how the dictionary invisibly overwrites duplicate key entries, whereas Pandas attempts to retain them. Just an ambiguity for you to decide how you want resolved.

Writing python dictionary from dataframe checking intersections between values

Answers (1)

Related Questions