Craig Anderson
Craig Anderson

Reputation: 63

Selecting objects in Python dictionaries based on values

I am brand-new to Python, having decided to make the jump from Matlab. I have tried to find the answer to my question for days but without success!

The problem: I have a bunch of objects with certain attributes. Note that I am not talking about objects and attributes in the programming sense of the word - I am talking about literal astronomical objects about which I have different types of numerical data and physical attributes for.

In a loop in my script, I go through each source/object in my catalogue, do some calculations, and stick the results in a huge dictionary. The form of the script is like this:

for i in range ( len(ObjectCatalogue) )

    calculate quantity1 for source i    

    calculate quantity2 for source i 

    determine attribute1 for source i 

    sourceDataDict[i].update( {'spectrum':quantity1} )

    sourceDataDict[i].update( {'peakflux':quantity2} )

    sourceDataDict[i].update( {'morphology':attribute1} )

So once I have gone through a hundred sources or so, I can, say, access the spectrum for object no. 20 with spectrumSource20 = sourceData[20]['spectrum'] etc.

What I want to do is be able to select all objects in the dictionary based on the value of the keyword 'morphology' say. So say the keyword for 'morphology' can take on the values 'simple' or 'complex'. Is there anyway I can do this without resorting to a loop? I.e. - could I do something like create a new dictionary that contains all the sources that take the 'complex' value for the 'morphology' keyword?

Hard to explain, but using logical indexing that I am used to from Matlab, it would look something like

complexSourceDataDict = sourceDataDict[*]['morphology'=='complex']

(where * indicates all objects in the dictionary)

Anyway - any help would be greatly appreciated!

Upvotes: 3

Views: 6310

Answers (4)

Abhijit
Abhijit

Reputation: 63737

I believe you are dealing with a structure similar to the following

sourceDataDict = [
    {'spectrum':1,
    'peakflux':10,
     'morphology':'simple'
    },
    {'spectrum':2,
    'peakflux':11,
     'morphology':'comlex'
     },
    {'spectrum':3,
    'peakflux':12,
     'morphology':'simple'
     },
    {'spectrum':4,
    'peakflux':13,
     'morphology':'complex'
     }
    ]

you can do something similar using List COmprehension

>>> [e for e in sourceDataDict if e.get('morphology',None) == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]

Using itertools.ifilter, you can achieve a similar result

>>> list(itertools.ifilter(lambda e:e.get('morphology',None) == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]

Please note, the use of get instead of indexing is to ensure that the functionality wont fail even when the key 'morphology' does not exist. In case, its definite to exist, you can rewrite the above as

>>> [e for e in sourceDataDict if e['morphology'] == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]

>>> list(itertools.ifilter(lambda e:e['morphology'] == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]

Upvotes: 1

Andrew D.
Andrew D.

Reputation: 1022

Working with huge amount of data, you may want to store it somewhere, so some sort of database and ORM (for instance), but latter is a matter of taste. Sort of RDBMS may be solution.

As for raw python, there is no built-in solution except of functional routines like filter. Anyway you face iteration at some step (implicitly or not).

The simpliest way is is keeping additional dict with keys as attribute values:

objectsBy['morphology'] = {'complex': set(), 'simple': set()}

for item in sources:
  ...
  objMorphology = compute_morphology(item)
  objectsBy['morphology'][objMorphology] += item
  ...

Upvotes: 0

Blckknght
Blckknght

Reputation: 104712

There's not a direct way to index nested dictionaries out of order, like your desired syntax wants to do. However, there are a few ways to do it in Python, with varying interfaces and performance characteristics.

The best performing solution would probably be to create an additional dictionary which indexes by whatever characteristic you care about. For instance, to find values with the 'morphology' value is 'complex', you'd d something like this:

from collections import defaultdict

# set up morphology dict (you could do this as part of generating the morphology)
morph_dict = defaultdict(list)
for data in sourceDataDict.values():
    morph_dict[data["morphology"]].append(data)

# later, you can access a list of the values with any particular morphology
complex_morph = morph_dict["complex"]

While this is high-performance, it may be annoying to need to set up the reverse indexes for everything ahead of time. An alternative might be to use a list comprehension or generator expression to iterate over your dictionary and finding the appropriate values:

complex = (d for d in sourceDataDict.values() if d["morphology"] == "complex")

for c in complex:
    do_whatever(c)

Upvotes: 1

jdi
jdi

Reputation: 92569

Without a loop, no. With a list comprehension, yes:

complex = [src for src in sourceDataDict.itervalues() if src.get('morphology') == 'complex']

If sourceDataDict happens to really be a list, you can drop the itervalues:

complex = [src for src in sourceDataDict if src.get('morphology') == 'complex']

If you think about it, evaluating a * would imply a loop operation under the hood anyways (assuming it were valid syntax). So your trick is to do the most efficient looping you can with the data structure you are using.

The only way to get more efficient would be to index all of the data objects "morphology" keys ahead of time and keep them up to date.

Upvotes: 3

Related Questions