CreamStat
CreamStat

Reputation: 2185

List as value per key in dictionary comprehension - Python

I have this dictionary and this data frame:

In [40]: 
atemp

Out[40]:
{0: ['adc telecommunications inc'],
 1: ['aflac inc'],
 2: ['agco corporation'],
 3: ['agl resources inc'],
 4: ['invesco ltd'],
 5: ['ak steel holding corporation'],
 6: ['amn healthcare services inc'],
 7: ['amr corporation']}

In [42]:

cemptemp


Out[42]:
Company name                               nstandar
   0    1-800-FLOWERS.COM                  1800flowerscom
   1    1347 PROPERTY INS HLDGS INC 1347   property ins hldgs inc
   2    1ST CAPITAL BANK                   1st capital bank
   3    1ST CENTURY BANCSHARES INC         1st century bancshares inc
   4    1ST CONSTITUTION BANCORP           1st constitution bancorp
   5    1ST ENTERPRISE BANK                1st enterprise bank
   6    1ST PACIFIC BANCORP                1st pacific bancorp
   7    1ST SOURCE CORP                    1st source corporation

With my code , I use each value of the dictionary to find the elements of the column nstandar of the pandas data frame where its jaccard distance to the value of the dictionary is greater than 0.1 and create a new dictionary where the key is the value of the former dictionary and the values are those of the data frame selected based on the jaccard distance.

I've tried this code but it just give one value per key and I know I should have a list per key.

sd={ y : row['nstandar'] for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}

So sd is:

{'adc telecommunications inc': '1st century bancshares inc',
 'aflac inc': '1st century bancshares inc',
 'agco corporation': '1st source corporation',
 'agl resources inc': '1st century bancshares inc',
 'ak steel holding corporation': '1st source corporation',
 'amn healthcare services inc': '1st century bancshares inc',
 'amr corporation': '1st source corporation'}

However, the expected output for the first key shuld be : 'adc telecommunications inc' :[ '1347 property ins hldgs inc' , '1st century bancshares inc']

So, How Can I fix my code to get what I want?

EDIT: The code of the jaccard distance is:

def jack(a,b):
    x=a.split()
    y=b.split()
    xy = set(x+y)              
    return float(len(x)+len(y)-len(xy))/float(len(xy))

EDIT 2: I came up with a solution:

from collections import defaultdict

td=defaultdict(list)

for k,value in atemp.iteritems():
    for y in value:
        for index , row in cemptemp.iterrows():
            if jack(y,row['nstandar'])>=0.1:
                td[y].append(row['nstandar'])

However, if try to write the same code but with dictionary comprehension, it doesn´t work:

from collections import defaultdict

td=defaultdict(list)


td={y : td[y].append(row['nstandar']) for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}

So, What´s the difference between my solution and the code with dict comprehension?

Upvotes: 0

Views: 166

Answers (1)

ZZY
ZZY

Reputation: 3937

In your first version of codes:

sd={ y : row['nstandar'] ...... }

row['nstandar'] is a string. The outcome is {str:str}, cannot be your expectation.

And your 2nd version:

`{y : td[y].append(row['nstandar']) ......}`

td[y].append(...) is a list append operation, the return value is None. So it's equal to {y: None}

If I understand your needs correctly, below codes can work:

from itertools import chain
{y: [row['nstandar'] for index, row in cemptemp.iterrows() if jack(y, row['nstandar'])>=0.1]
 for y in chain(*atemp.values())}

Just one possible difference: it also adds 'invesco ltd': [] into the outcome dict. If you really wants to filter out it within one line of code, then wrap my code with {k,v for k,v in XXXX.items() if len(v) > 0}.

However, I don't recommend dict comprehension for such complicated logic. Dict comprehension is for succinct codes that both easy to write and read. For complicated logic, it just causes negative effect. In my opinion, your for loop solution is better.

Upvotes: 1

Related Questions