Nikita Maksimov
Nikita Maksimov

Reputation: 25

How to find frequently together items in a python list of dictionaries

I am looking for a clean solution to the following problem, if anyone has any ideas please share, will be thankful for any ideas!

I have a list of dictionaries:

[
    {
       'project': 'Project A',
       'feature': ['Feature A', 'Feature B', 'Feature C']
    },
    {
        'project': 'Project B',
        'feature': ['Feature B', 'Feature C']
    },
    {
        'project': 'Project C',
        'feature': ['Feature D', 'Feature A']
    },
    {
        'project': 'Project D',
        'feature': ['Feature A', 'Feature B', 'Feature C']
    }
]

I am looking for features that are most commonly used together. So the output should be:

[
    {
    'features': ['Feature B', 'Feature C'],
    'together': 3
    },
    {
    'features': ['Feature A', 'Feature B', 'Feature C'],
    'together': 2
    },
    {
    'features': ['Feature A', 'Feature B'],
    'together': 2
    },
]

Some important features:

  1. The feature list could be any length.
  2. The longer list should have priority over lists with same 'together' value.

Will be very thankful for any ideas!

Upvotes: 0

Views: 81

Answers (4)

keithpjolley
keithpjolley

Reputation: 2263

There's a boatload of data to be found in information like this. If you think of the features as "nodes" then the projects "link" (edge) them. This is a network. For a small network it's easy enough to do by hand but when you have millions or billions then things really get interesting. Of course there's a module for that - networkx (and scipy).

Anyways, here's an example of using your data:

import networkx as nx
from itertools import combinations
pf = [
        {
           'project': 'Project A',
           'feature': ['Feature A', 'Feature B', 'Feature C']
        },
        {
            'project': 'Project B',
            'feature': ['Feature B', 'Feature C']
        },
        {
            'project': 'Project C',
            'feature': ['Feature D', 'Feature A']
        },
        {
            'project': 'Project D',
            'feature': ['Feature A', 'Feature B', 'Feature C']
        }
    ]
G = nx.MultiGraph()
[G.add_edges_from(combinations(p['feature'], r=2), label=f"{p['project']}") for p in pf]
nx.nx_agraph.write_dot(G, '/tmp/project.dot')
# Convert to picture w/ 
# dot /tmp/project.dot -T png > /tmp/project.png && open /tmp/project.png

This produces the following picture: enter image description here

You can see from the graph that "Feature A," "Feature B," and "Feature C" all have five links while "Feature D" only has one.

nx.degree(G)
MultiDegreeView({'Feature A': 5, 'Feature B': 5, 'Feature C': 5, 'Feature D': 1})

Your question was to see about the connectedness of the the features:

There are a number of ways of answering that:

nx.all_pairs_node_connectivity(G)
 
{'Feature A': {'Feature B': 2, 'Feature C': 2, 'Feature D': 1},
 'Feature B': {'Feature A': 2, 'Feature C': 2, 'Feature D': 1},
 'Feature C': {'Feature A': 2, 'Feature B': 2, 'Feature D': 1},
 'Feature D': {'Feature A': 1, 'Feature B': 1, 'Feature C': 1}}

However, not all links are created equal. This was the brilliant insight that made billionaires of many googlers. In a small graph like this pagerank isn't all that useful but as you can kind of see, "Feature A" is the most important node because it connects between all the features, linking "Feature D" to the others. pagerank quantifies that feature:

nx.pagerank(G)

{'Feature A': 0.3164650028728918,
 'Feature B': 0.2961180914118473,
 'Feature C': 0.2961180914118473,
 'Feature D': 0.09129881430341397}

If you end up with a bunch of data there's a lot to explore and networkx can help you make sense of your data. You may also want to drop the "Project X" and use a nx.Graph(). Then, instead of having two unique links between "Feature A" and "Feature B" you'd have one link of weight=2. Some of the algorithms in networkx, et. al., will work better this way, if the project isn't important.

Hopefully gives some ideas into how to gain insights with data like this.

Anyways, to finally answer the question directly. Don't look too closely at this:

from itertools import combinations
features = [i for s in [[", ".join(a) for a in s] for s in [[a for a in combinations(p['feature'], r=r+1)] for p in pf for r in range(len(p['feature']))]] for i in s]
print({f:features.count(f) for f in features})

{'Feature A': 3,
 'Feature B': 3,
 'Feature C': 3,
 'Feature A, Feature B': 2,
 'Feature A, Feature C': 2,
 'Feature B, Feature C': 3,
 'Feature A, Feature B, Feature C': 2,
 'Feature D': 1,
 'Feature D, Feature A': 1}

Upvotes: 0

pakpe
pakpe

Reputation: 5479

Here is another solution:

import itertools as it

lst = [
    {
       'project': 'Project A',
       'feature': ['Feature A', 'Feature B', 'Feature C']
    },
    {
        'project': 'Project B',
        'feature': ['Feature B', 'Feature C']
    },
    {
        'project': 'Project C',
        'feature': ['Feature D', 'Feature A']
    },
    {
        'project': 'Project D',
        'feature': ['Feature A', 'Feature B', 'Feature C']
    }
]
#create a set of features
features = []
for dic in lst:
    features += (dic["feature"])
features = set(features)

#create a list of all combinations of features >= length 2
combos = []
for i in range(2, len(features)+1):
    combos += (list(it.combinations(features,i)))

#iterate through the combinations, count their occurence and add to a list of dictionaries
result = []
for tup in combos:
    count = 0
    for dic in lst:
        if set(tup).issubset(dic["feature"]):
            count +=1
    if count > 0:
        result.append({"features": list(tup), "together": count})

print(result)
#[{'features': ['Feature C', 'Feature B'], 'together': 3}, 
{'features': ['Feature C', 'Feature A'], 'together': 2}, 
{'features': ['Feature B', 'Feature A'], 'together': 2}, 
{'features': ['Feature D', 'Feature A'], 'together': 1}, 
{'features': ['Feature C', 'Feature B', 'Feature A'], 'together': 2}]

Upvotes: 1

Suneesh Jacob
Suneesh Jacob

Reputation: 806

Code:

raw_list = [
    {
       'project': 'Project A',
       'feature': ['Feature A', 'Feature B', 'Feature C']
    },
    {
        'project': 'Project B',
        'feature': ['Feature B', 'Feature C']
    },
    {
        'project': 'Project C',
        'feature': ['Feature D', 'Feature A']
    },
    {
        'project': 'Project D',
        'feature': ['Feature A', 'Feature B', 'Feature C']
    }
]

features = [tuple(i['feature']) for i in raw_list]
unique_feature_list = list(set(features))

unique_features = set({})
for i in unique_feature_list:
    for j in range(len(i)):
        for k in itertools.combinations(i,j+1):
            if k not in unique_features:
                unique_features.add(k)




structured_list = []
for i in unique_features:
    num = sum([1 if len(set(i)-set(j['feature']))==0 else 0 for j in raw_list])
    structured_list.append({'features':list(i), 'together':num})
    
print(structured_list)

Output would be something like:

[{'features': ['Feature C'], 'together': 3},
 {'features': ['Feature B'], 'together': 3},
 {'features': ['Feature D', 'Feature A'], 'together': 1},
 {'features': ['Feature B', 'Feature C'], 'together': 3},
 {'features': ['Feature D'], 'together': 1},
 {'features': ['Feature A'], 'together': 3},
 {'features': ['Feature A', 'Feature C'], 'together': 2},
 {'features': ['Feature A', 'Feature B'], 'together': 2},
 {'features': ['Feature A', 'Feature B', 'Feature C'], 'together': 2}]

Upvotes: 0

Game Developement
Game Developement

Reputation: 409

It isn't a completed code by far, but enough to get you started on how the algorithm would look:

import itertools
a = 'abcd' ## all possible features
b = ['abc','bc','da','abc'] # list of features for each project: A,B,C,D (see OP list of dicts)
ds = []
es = []
for i in range(2,len(a)+1):
    d = list(itertools.permutations(a,i))
    e = [0 for e in d]
    for c in range(len(d)):
        for be in b:
            if ''.join(d[c]) in be:
                e[c] += 1
    es.append(e)
    ds.append(d)

for i in range(len(es)):
    for j in range(len(es[i])):
        print(ds[i][j],es[i][j])

Output (occurences of each group):

('a', 'b') 2
('a', 'c') 0
('a', 'd') 0
('b', 'a') 0
('b', 'c') 3
('b', 'd') 0
('c', 'a') 0
('c', 'b') 0
('c', 'd') 0
('d', 'a') 1
('d', 'b') 0
('d', 'c') 0
('a', 'b', 'c') 2
('a', 'b', 'd') 0
('a', 'c', 'b') 0
('a', 'c', 'd') 0
('a', 'd', 'b') 0
('a', 'd', 'c') 0
('b', 'a', 'c') 0
('b', 'a', 'd') 0
('b', 'c', 'a') 0
('b', 'c', 'd') 0
('b', 'd', 'a') 0
('b', 'd', 'c') 0
('c', 'a', 'b') 0
('c', 'a', 'd') 0
('c', 'b', 'a') 0
('c', 'b', 'd') 0
('c', 'd', 'a') 0
('c', 'd', 'b') 0
('d', 'a', 'b') 0
('d', 'a', 'c') 0
('d', 'b', 'a') 0
('d', 'b', 'c') 0
('d', 'c', 'a') 0
('d', 'c', 'b') 0
('a', 'b', 'c', 'd') 0
('a', 'b', 'd', 'c') 0
('a', 'c', 'b', 'd') 0
('a', 'c', 'd', 'b') 0
('a', 'd', 'b', 'c') 0
('a', 'd', 'c', 'b') 0
('b', 'a', 'c', 'd') 0
('b', 'a', 'd', 'c') 0
('b', 'c', 'a', 'd') 0
('b', 'c', 'd', 'a') 0
('b', 'd', 'a', 'c') 0
('b', 'd', 'c', 'a') 0
('c', 'a', 'b', 'd') 0
('c', 'a', 'd', 'b') 0
('c', 'b', 'a', 'd') 0
('c', 'b', 'd', 'a') 0
('c', 'd', 'a', 'b') 0
('c', 'd', 'b', 'a') 0
('d', 'a', 'b', 'c') 0
('d', 'a', 'c', 'b') 0
('d', 'b', 'a', 'c') 0
('d', 'b', 'c', 'a') 0
('d', 'c', 'a', 'b') 0
('d', 'c', 'b', 'a') 0

Change it to itertools.combinations if need be. Change to for i in range(2,len(a)) if you don't want groups of 4

Hope it helped!

Upvotes: 2

Related Questions