Reputation: 25
I am looking for a clean solution to the following problem, if anyone has any ideas please share, will be thankful for any ideas!
I have a list of dictionaries:
[
{
'project': 'Project A',
'feature': ['Feature A', 'Feature B', 'Feature C']
},
{
'project': 'Project B',
'feature': ['Feature B', 'Feature C']
},
{
'project': 'Project C',
'feature': ['Feature D', 'Feature A']
},
{
'project': 'Project D',
'feature': ['Feature A', 'Feature B', 'Feature C']
}
]
I am looking for features that are most commonly used together. So the output should be:
[
{
'features': ['Feature B', 'Feature C'],
'together': 3
},
{
'features': ['Feature A', 'Feature B', 'Feature C'],
'together': 2
},
{
'features': ['Feature A', 'Feature B'],
'together': 2
},
]
Some important features:
Will be very thankful for any ideas!
Upvotes: 0
Views: 81
Reputation: 2263
There's a boatload of data to be found in information like this. If you think of the features as "nodes" then the projects "link" (edge) them. This is a network. For a small network it's easy enough to do by hand but when you have millions or billions then things really get interesting. Of course there's a module for that - networkx (and scipy).
Anyways, here's an example of using your data:
import networkx as nx
from itertools import combinations
pf = [
{
'project': 'Project A',
'feature': ['Feature A', 'Feature B', 'Feature C']
},
{
'project': 'Project B',
'feature': ['Feature B', 'Feature C']
},
{
'project': 'Project C',
'feature': ['Feature D', 'Feature A']
},
{
'project': 'Project D',
'feature': ['Feature A', 'Feature B', 'Feature C']
}
]
G = nx.MultiGraph()
[G.add_edges_from(combinations(p['feature'], r=2), label=f"{p['project']}") for p in pf]
nx.nx_agraph.write_dot(G, '/tmp/project.dot')
# Convert to picture w/
# dot /tmp/project.dot -T png > /tmp/project.png && open /tmp/project.png
This produces the following picture:
You can see from the graph that "Feature A," "Feature B," and "Feature C" all have five links while "Feature D" only has one.
nx.degree(G)
MultiDegreeView({'Feature A': 5, 'Feature B': 5, 'Feature C': 5, 'Feature D': 1})
Your question was to see about the connectedness of the the features:
There are a number of ways of answering that:
nx.all_pairs_node_connectivity(G)
{'Feature A': {'Feature B': 2, 'Feature C': 2, 'Feature D': 1},
'Feature B': {'Feature A': 2, 'Feature C': 2, 'Feature D': 1},
'Feature C': {'Feature A': 2, 'Feature B': 2, 'Feature D': 1},
'Feature D': {'Feature A': 1, 'Feature B': 1, 'Feature C': 1}}
However, not all links are created equal. This was the brilliant insight that made billionaires of many googlers. In a small graph like this pagerank isn't all that useful but as you can kind of see, "Feature A" is the most important node because it connects between all the features, linking "Feature D" to the others. pagerank
quantifies that feature:
nx.pagerank(G)
{'Feature A': 0.3164650028728918,
'Feature B': 0.2961180914118473,
'Feature C': 0.2961180914118473,
'Feature D': 0.09129881430341397}
If you end up with a bunch of data there's a lot to explore and networkx can help you make sense of your data. You may also want to drop the "Project X" and use a nx.Graph()
. Then, instead of having two unique links between "Feature A" and "Feature B" you'd have one link of weight=2
. Some of the algorithms in networkx, et. al., will work better this way, if the project isn't important.
Hopefully gives some ideas into how to gain insights with data like this.
Anyways, to finally answer the question directly. Don't look too closely at this:
from itertools import combinations
features = [i for s in [[", ".join(a) for a in s] for s in [[a for a in combinations(p['feature'], r=r+1)] for p in pf for r in range(len(p['feature']))]] for i in s]
print({f:features.count(f) for f in features})
{'Feature A': 3,
'Feature B': 3,
'Feature C': 3,
'Feature A, Feature B': 2,
'Feature A, Feature C': 2,
'Feature B, Feature C': 3,
'Feature A, Feature B, Feature C': 2,
'Feature D': 1,
'Feature D, Feature A': 1}
Upvotes: 0
Reputation: 5479
Here is another solution:
import itertools as it
lst = [
{
'project': 'Project A',
'feature': ['Feature A', 'Feature B', 'Feature C']
},
{
'project': 'Project B',
'feature': ['Feature B', 'Feature C']
},
{
'project': 'Project C',
'feature': ['Feature D', 'Feature A']
},
{
'project': 'Project D',
'feature': ['Feature A', 'Feature B', 'Feature C']
}
]
#create a set of features
features = []
for dic in lst:
features += (dic["feature"])
features = set(features)
#create a list of all combinations of features >= length 2
combos = []
for i in range(2, len(features)+1):
combos += (list(it.combinations(features,i)))
#iterate through the combinations, count their occurence and add to a list of dictionaries
result = []
for tup in combos:
count = 0
for dic in lst:
if set(tup).issubset(dic["feature"]):
count +=1
if count > 0:
result.append({"features": list(tup), "together": count})
print(result)
#[{'features': ['Feature C', 'Feature B'], 'together': 3},
{'features': ['Feature C', 'Feature A'], 'together': 2},
{'features': ['Feature B', 'Feature A'], 'together': 2},
{'features': ['Feature D', 'Feature A'], 'together': 1},
{'features': ['Feature C', 'Feature B', 'Feature A'], 'together': 2}]
Upvotes: 1
Reputation: 806
Code:
raw_list = [
{
'project': 'Project A',
'feature': ['Feature A', 'Feature B', 'Feature C']
},
{
'project': 'Project B',
'feature': ['Feature B', 'Feature C']
},
{
'project': 'Project C',
'feature': ['Feature D', 'Feature A']
},
{
'project': 'Project D',
'feature': ['Feature A', 'Feature B', 'Feature C']
}
]
features = [tuple(i['feature']) for i in raw_list]
unique_feature_list = list(set(features))
unique_features = set({})
for i in unique_feature_list:
for j in range(len(i)):
for k in itertools.combinations(i,j+1):
if k not in unique_features:
unique_features.add(k)
structured_list = []
for i in unique_features:
num = sum([1 if len(set(i)-set(j['feature']))==0 else 0 for j in raw_list])
structured_list.append({'features':list(i), 'together':num})
print(structured_list)
Output would be something like:
[{'features': ['Feature C'], 'together': 3},
{'features': ['Feature B'], 'together': 3},
{'features': ['Feature D', 'Feature A'], 'together': 1},
{'features': ['Feature B', 'Feature C'], 'together': 3},
{'features': ['Feature D'], 'together': 1},
{'features': ['Feature A'], 'together': 3},
{'features': ['Feature A', 'Feature C'], 'together': 2},
{'features': ['Feature A', 'Feature B'], 'together': 2},
{'features': ['Feature A', 'Feature B', 'Feature C'], 'together': 2}]
Upvotes: 0
Reputation: 409
It isn't a completed code by far, but enough to get you started on how the algorithm would look:
import itertools
a = 'abcd' ## all possible features
b = ['abc','bc','da','abc'] # list of features for each project: A,B,C,D (see OP list of dicts)
ds = []
es = []
for i in range(2,len(a)+1):
d = list(itertools.permutations(a,i))
e = [0 for e in d]
for c in range(len(d)):
for be in b:
if ''.join(d[c]) in be:
e[c] += 1
es.append(e)
ds.append(d)
for i in range(len(es)):
for j in range(len(es[i])):
print(ds[i][j],es[i][j])
Output (occurences of each group):
('a', 'b') 2
('a', 'c') 0
('a', 'd') 0
('b', 'a') 0
('b', 'c') 3
('b', 'd') 0
('c', 'a') 0
('c', 'b') 0
('c', 'd') 0
('d', 'a') 1
('d', 'b') 0
('d', 'c') 0
('a', 'b', 'c') 2
('a', 'b', 'd') 0
('a', 'c', 'b') 0
('a', 'c', 'd') 0
('a', 'd', 'b') 0
('a', 'd', 'c') 0
('b', 'a', 'c') 0
('b', 'a', 'd') 0
('b', 'c', 'a') 0
('b', 'c', 'd') 0
('b', 'd', 'a') 0
('b', 'd', 'c') 0
('c', 'a', 'b') 0
('c', 'a', 'd') 0
('c', 'b', 'a') 0
('c', 'b', 'd') 0
('c', 'd', 'a') 0
('c', 'd', 'b') 0
('d', 'a', 'b') 0
('d', 'a', 'c') 0
('d', 'b', 'a') 0
('d', 'b', 'c') 0
('d', 'c', 'a') 0
('d', 'c', 'b') 0
('a', 'b', 'c', 'd') 0
('a', 'b', 'd', 'c') 0
('a', 'c', 'b', 'd') 0
('a', 'c', 'd', 'b') 0
('a', 'd', 'b', 'c') 0
('a', 'd', 'c', 'b') 0
('b', 'a', 'c', 'd') 0
('b', 'a', 'd', 'c') 0
('b', 'c', 'a', 'd') 0
('b', 'c', 'd', 'a') 0
('b', 'd', 'a', 'c') 0
('b', 'd', 'c', 'a') 0
('c', 'a', 'b', 'd') 0
('c', 'a', 'd', 'b') 0
('c', 'b', 'a', 'd') 0
('c', 'b', 'd', 'a') 0
('c', 'd', 'a', 'b') 0
('c', 'd', 'b', 'a') 0
('d', 'a', 'b', 'c') 0
('d', 'a', 'c', 'b') 0
('d', 'b', 'a', 'c') 0
('d', 'b', 'c', 'a') 0
('d', 'c', 'a', 'b') 0
('d', 'c', 'b', 'a') 0
Change it to itertools.combinations
if need be. Change to for i in range(2,len(a))
if you don't want groups of 4
Hope it helped!
Upvotes: 2