Reputation: 749
I am using YAML files to allow users to configure a serial workflow to a python program that I am developing:
step1:
method1:
param_x: 44
method2:
param_y: 14
param_t: string
method1:
param_x: 22
step2:
method2:
param_z: 7
method1:
param_x: 44
step3:
method3:
param_a: string
This is then be parsed in python and stored as a dictionary. Now, I know duplicate keys in YAML and python dictionaries are not allowed (why, btw?), but YAML seems perfect for my case given it's clarity and minimalism.
I tried to follow an approach suggested in this question (Getting duplicate keys in YAML using Python). However, in my case, sometimes they are duplicated, and sometimes not and using the proposed construct_yaml_map
, this will either create a dict or a list, which is not what I want. Depending on the node depth I would like to be able to send keys and values on the second level (method1, method2, ...) to a list within a python dictionary, do avoid the duplication issue.
Upvotes: 0
Views: 1847
Reputation: 2123
I know duplicate keys in YAML and python dictionaries are not allowed (why, btw?)
Because that's the whole point of a dictionary. That's how the data structure is defined. It's just like asking why each index in an array can only have one value. Because that's how it's defined. (And there are many useful things about it being defined that way.)
If a dictionary could have duplicate keys, then what would it mean to look up a value by key? Would you get one arbitrarily? Would you get a list of values instead of a single value? The whole thing would be more trouble to work with.
If your data model needs to be a mapping of keys to potentially multiple values, then you probably still want a dictionary, but the values are now actually arrays of values.
If you need to know the order that things were inserted, and different values may have the same logical key, you don't actually want a dictionary. You want an array/list. Yes, ordered dictionaries exist, but they're not going to do what you want when trying to have lists of values.
The bottom line is that an explicitly serial workflow should usually be modeled with a serial data structure--an array/list instead of a dictionary/map.
Consider this alternate structure:
steps:
- name: step1
methods:
- name: method1
params:
x: 44
- name: method2
params:
y: 14
t: string
- name: method1
params:
x: 22
- name: step2
methods:
- name: method2
params:
z: 7
- name: method1
params:
x: 44
- name: step3
methods:
- name: method3
params:
a: string
I believe this captures your intent better, and it is valid yaml without having to resort to any shenanigans.
Upvotes: 1
Reputation: 76812
While parsing ruamel.yaml
has no concept of depth beyond being at
the root level of a document (among other things in order to allow for
root level literal scalars to be unindented). Adding such a notion of depth is going to be difficult,
since you have to deal with aliases and possible recursive occurrences
of data, I am also not sure what this would mean in general (although clear enough for your example).
The method creating a mapping in the default, round-trip, loader of ruamel.yaml is rather long. But if you are going to jumble mapping values together, you should not expect to be able to dump them back. let alone preserve comments, aliases, etc. The following assumes you'll be using the simpler safe loader, have aliases and/or merge keys.
import sys
import ruamel.yaml
yaml_str = """\
step1:
method1:
param_x: 44
method2:
param_y: 14
param_t: string
method1:
param_x: 22
step2:
method2:
param_z: 7
method1:
param_x: 44
step3:
method3:
param_a: string
"""
from ruamel.yaml.nodes import *
from ruamel.yaml.compat import Hashable, PY2
class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
def construct_mapping(self, node, deep=False):
if not isinstance(node, MappingNode):
raise ConstructorError(
None, None, 'expected a mapping node, but found %s' % node.id, node.start_mark
)
total_mapping = self.yaml_base_dict_type()
if getattr(node, 'merge', None) is not None:
todo = [(node.merge, False), (node.value, False)]
else:
todo = [(node.value, True)]
for values, check in todo:
mapping = self.yaml_base_dict_type() # type: Dict[Any, Any]
for key_node, value_node in values:
# keys can be list -> deep
key = self.construct_object(key_node, deep=True)
# lists are not hashable, but tuples are
if not isinstance(key, Hashable):
if isinstance(key, list):
key = tuple(key)
if PY2:
try:
hash(key)
except TypeError as exc:
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unacceptable key (%s)' % exc,
key_node.start_mark,
)
else:
if not isinstance(key, Hashable):
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unhashable key',
key_node.start_mark,
)
value = self.construct_object(value_node, deep=deep)
if key in mapping:
if not isinstance(mapping[key], list):
mapping[key] = [mapping[key]]
mapping[key].append(value)
else:
mapping[key] = value
total_mapping.update(mapping)
return total_mapping
yaml = ruamel.yaml.YAML(typ='safe')
yaml.Constructor = MyConstructor
data = yaml.load(yaml_str)
for k1 in data:
# might need to guard this with a try-except for non-dictionary first-level values
for k2 in data[k1]:
if not isinstance(data[k1][k2], list): # make every second level value a list
data[k1][k2] = [data[k1][k2]]
print(data['step1'])
which gives:
{'method1': [{'param_x': 44}, {'param_x': 22}], 'method2': [{'param_y': 14, 'param_t': 'string'}]}
Upvotes: 1