Reputation: 2185
I have this dictionary and this data frame:
In [40]:
atemp
Out[40]:
{0: ['adc telecommunications inc'],
1: ['aflac inc'],
2: ['agco corporation'],
3: ['agl resources inc'],
4: ['invesco ltd'],
5: ['ak steel holding corporation'],
6: ['amn healthcare services inc'],
7: ['amr corporation']}
In [42]:
cemptemp
Out[42]:
Company name nstandar
0 1-800-FLOWERS.COM 1800flowerscom
1 1347 PROPERTY INS HLDGS INC 1347 property ins hldgs inc
2 1ST CAPITAL BANK 1st capital bank
3 1ST CENTURY BANCSHARES INC 1st century bancshares inc
4 1ST CONSTITUTION BANCORP 1st constitution bancorp
5 1ST ENTERPRISE BANK 1st enterprise bank
6 1ST PACIFIC BANCORP 1st pacific bancorp
7 1ST SOURCE CORP 1st source corporation
With my code , I use each value of the dictionary to find the elements of the column nstandar of the pandas data frame where its jaccard distance to the value of the dictionary is greater than 0.1 and create a new dictionary where the key is the value of the former dictionary and the values are those of the data frame selected based on the jaccard distance.
I've tried this code but it just give one value per key and I know I should have a list per key.
sd={ y : row['nstandar'] for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}
So sd is:
{'adc telecommunications inc': '1st century bancshares inc',
'aflac inc': '1st century bancshares inc',
'agco corporation': '1st source corporation',
'agl resources inc': '1st century bancshares inc',
'ak steel holding corporation': '1st source corporation',
'amn healthcare services inc': '1st century bancshares inc',
'amr corporation': '1st source corporation'}
However, the expected output for the first key shuld be : 'adc telecommunications inc' :[ '1347 property ins hldgs inc' , '1st century bancshares inc']
So, How Can I fix my code to get what I want?
EDIT: The code of the jaccard distance is:
def jack(a,b):
x=a.split()
y=b.split()
xy = set(x+y)
return float(len(x)+len(y)-len(xy))/float(len(xy))
EDIT 2: I came up with a solution:
from collections import defaultdict
td=defaultdict(list)
for k,value in atemp.iteritems():
for y in value:
for index , row in cemptemp.iterrows():
if jack(y,row['nstandar'])>=0.1:
td[y].append(row['nstandar'])
However, if try to write the same code but with dictionary comprehension, it doesn´t work:
from collections import defaultdict
td=defaultdict(list)
td={y : td[y].append(row['nstandar']) for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}
So, What´s the difference between my solution and the code with dict comprehension?
Upvotes: 0
Views: 166
Reputation: 3937
In your first version of codes:
sd={ y : row['nstandar'] ...... }
row['nstandar']
is a string. The outcome is {str:str}, cannot be your expectation.
And your 2nd version:
`{y : td[y].append(row['nstandar']) ......}`
td[y].append(...)
is a list append operation, the return value is None. So it's equal to {y: None}
If I understand your needs correctly, below codes can work:
from itertools import chain
{y: [row['nstandar'] for index, row in cemptemp.iterrows() if jack(y, row['nstandar'])>=0.1]
for y in chain(*atemp.values())}
Just one possible difference: it also adds 'invesco ltd': []
into the outcome dict. If you really wants to filter out it within one line of code, then wrap my code with {k,v for k,v in XXXX.items() if len(v) > 0}.
However, I don't recommend dict comprehension for such complicated logic. Dict comprehension is for succinct codes that both easy to write and read. For complicated logic, it just causes negative effect. In my opinion, your for loop solution is better.
Upvotes: 1