paddu
paddu

Reputation: 713

aggregation of data combining dict of list

I have a file with following contents.

1234:yahoo\tgoogle\tmicrosoft\tapple\tyahoo

2345:apple\tgoogle\tgoogle

4567:yahoo\tapple\tapple

I am interested in getting the output

"Output"--> searchTerm : UserCnt, searchCnt

yahoo: 2, 3

apple: 3, 4

and so on...

fname="/tmp/sample.txt"
with open(fname) as f:
   content = f.readlines()

value = [ i.strip().split(':') for i in content ]
dict = {k:v.split('\t') for k,v  in value}

d = defaultdict(int)
for k,v in dict.items():
    for name in v:
      d[name] +=1
    print k,d

But, how do I get user count and search count for each search term.

Upvotes: 0

Views: 49

Answers (1)

Srini
Srini

Reputation: 1639

Yes, you can use a defaultdict to do this (or just a regular dict too, but I think a defaultdict is more flexible)

In [36]: a = defaultdict(defaultdict)

In [40]: l  = ["1234:yahoo\tgoogle\tmicrosoft\tapple\tyahoo", "2345:apple\tgoogle\tgoogle", "4567:yahoo\tapple\tapple"]

In [48]: for li in l:
    ...:     search_id, terms = li.split(":")[0], li.split(":")[1]
    ...:     terms = terms.split("\t")
    ...:     for term in terms:
    ...:         if "search_cnt" in a[term]:
    ...:             a[term]["search_cnt"] += 1
    ...:         else:
    ...:             a[term]["search_cnt"] = 1
    ...:     for term in set(terms):
    ...:         if "user_cnt" in a[term]:
    ...:             a[term]["user_cnt"] += 1
    ...:         else:
    ...:             a[term]["user_cnt"] = 1

In [49]: a
Out[49]:
defaultdict(collections.defaultdict,
            {'apple': defaultdict(None, {'search_cnt': 4, 'user_cnt': 3}),
             'google': defaultdict(None, {'search_cnt': 3, 'user_cnt': 2}),
             'microsoft': defaultdict(None, {'search_cnt': 1, 'user_cnt': 1}),
             'yahoo': defaultdict(None, {'search_cnt': 3, 'user_cnt': 2})})

The default dict above contains the counts you need.

The reason I use the set for the second term iteration is that if 1 user searched for a term multiple times, the term's user count should not increment :)

Upvotes: 1

Related Questions