sola
sola

Reputation: 1576

Merge YAML files with overriding values in list elements

I would like to merge two YAML files that contain list elements. (A) and (B) merged into a new file (C).

I would like to override existing attribute values of the list entries in (A) if they are also defined in (B).

I would like to add new attributes to list entries if they are not defined in (A) but defined in (B).

I would also like to add new list entries of (B) as well if not present in (A).

YAML file A:

list:
  - id: 1
    name: "name-from-A"
  - id: 2
    name: "name-from-A"

YAML file B:

list:
  - id: 1
    name: "name-from-B"
  - id: 2
    title: "title-from-B"
  - id: 3
    name: "name-from-B"
    title: "title-from-B"

The merged YAML file (C), I would like to produce:

list:
  - id: 1
    name: "name-from-B"
  - id: 2
    name: "name-from-A"
    title: "title-from-B"
  - id: 3
    name: "name-from-B"
    title: "title-from-B"

I need this functionality in a Bash script but I can require Python in the environment.

Is there any standalone YAML processor (like yq) that can do this?

How would I implement something like this in a Python script?

Upvotes: 2

Views: 6261

Answers (3)

sola
sola

Reputation: 1576

Based on the answers (thanks guys), I have created a solution that handles all of the merging features I need ATM in a fairly generic way (I need to use it on a lot of different types of Kubernetes descriptors).

It is based on Ruamel.

It handles multi-level lists and manages not only merging list elements by index but by proper item identification as well.

It is more complex than I hoped for (it traverses the YAML tree).

The script and core methods:

import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq


#
# Merges a node from B with its pair in A
#
# If the node exists in both A and B, it will merge
# all children in sync
#
# If the node only exists in A, it will do nothing.
#
# If the node only exists in B, it will add it to A and stops
#
# attrPath DOES NOT include attrName
#
def mergeAttribute(parentNodeA, nodeA, nodeB, attrName, attrPath):

    # If both is None, there is nothing to merge
    if (nodeA is None) and (nodeB is None):
        return

    # If NodeA is None but NodeB has value, we simply set it in A
    if (nodeA is None) and (parentNodeA is not None):
        parentNodeA[attrName] = nodeB
        return

    if attrPath == '':
        attrPath = attrName
    else:
        attrPath = attrPath + '.' + attrName

    if isinstance(nodeB, CommentedSeq):

        # The attribute is a list, we need to merge specially
        mergeList(nodeA, nodeB, attrPath)

    elif isinstance(nodeB, CommentedMap):

        # A simple object to be merged
        mergeObject(nodeA, nodeB, attrPath)

    else:
        # Primitive type, simply overwrites
        parentNodeA[attrName] = nodeB


#
# Lists object attributes and merges the attribute values if possible
#
def mergeObject(nodeA, nodeB, attrPath):

    for attrName in nodeB:

        subNodeA = None
        if attrName in nodeA:
            subNodeA = nodeA[attrName]

        subNodeB = None
        if attrName in nodeB:
            subNodeB = nodeB[attrName]

        mergeAttribute(nodeA, subNodeA, subNodeB, attrName, attrPath)


#
# Merges two lists by properly identifying each item in both lists
# (using the merge-directives).
#
# If an item of listB is identified in listA, it will be merged onto the item
# of listA
#
def mergeList(listA, listB, attrPath):

    # Iterating the list from B
    for itemInB in listB:

        itemInA = findItemInList(listA, itemInB, attrPath)

        if itemInA is None:
            listA.append(itemInB)
            continue

        # Present in both, we need to merge them
        mergeObject(itemInA, itemInB, attrPath)


#
# Finds an item in the list by using the appropriate ID field defined for that
# attribute-path.
#
# If there is no id attribute defined for the list, it returns None
#
def findItemInList(listA, itemB, attrPath):

    if attrPath not in listsWithId:
        # No id field defined for the list, only "dumb" merging is possible
        return None

    # Finding out the name of the id attribute in the list items
    idAttrName = listsWithId[attrPath]

    idB = None
    if idAttrName is not None:
        idB = itemB[idAttrName]

    # Looking for the item by its ID
    for itemA in listA:

        idA = None
        if idAttrName is not None:
            idA = itemA[idAttrName]

        if idA == idB:
            return itemA

    return None

# ------------------------------------------------------------------------------


yaml = ruamel.yaml.YAML()

# Load the merge directives
with open('merge-directives.yaml') as fp:
    mergeDirectives = yaml.load(fp)

listsWithId = mergeDirectives['lists-with-id']

# Load the yaml files
with open('a.yaml') as fp:
    dataA = yaml.load(fp)

with open('b.yaml') as fp:
    dataB = yaml.load(fp)

mergeObject(dataA, dataB, '')

# create a new file with the merged yaml
yaml.dump(dataA, file('c.yaml', 'w'))

The helper config file (merge-directives.yaml) that instructs about the identification of elements in (even multi-level) lists.

For the data structure in the original question, only the 'list: "id" ' config entry is needed but I included some other keys to demonstrate usage.

#
# Lists that contain identifiable elements.
#
# Each sub-key is a property path denoting the list element in the YAML 
# data structure.
#
# The value is the name of the attribute in the list element that
# identifies the list element so that pairing can be made.
#
lists-with-id:
    list: "id"
    list.sub-list: "id"
    a.listAttrShared: "name"

Not yet tested heavily but here are two test files that tests more completely than in the original question.

a.yaml:

a:
    attrShared: value-from-a
    listAttrShared:
        - name: a1
        - name: a2
    attrOfAOnly: value-from-a
list:
    - id: 1
      name: "name-from-A"
      sub-list:
          - id: s1
            name: "name-from-A"
            comments: "doesn't exist in B, so left untouched"
          - id: s2
            name: "name-from-A"
      sub-list-with-no-identification:
          - "comment 1"
          - "comment 2"
    - id: 2
      name: "name-from-A"

b.yaml:

a:
    attrShared: value-from-b
    listAttrShared:
        - name: b1
        - name: b2
    attrOfBOnly: value-from-b
list:
    - id: 1
      name: "name-from-B"
      sub-list:
          - id: s2
            name: "name-from-B"
            title: "title-from-B"
            comments: "overwrites name in A with name in B + adds title from B"
          - id: s3
            name: "name-from-B"
            comments: "only exists in B so added to A's list"
      sub-list-with-no-identification:
          - "comment 3"
          - "comment 4"
    - id: 2
      title: "title-from-B"
    - id: 3
      name: "name-from-B"
      title: "title-from-B"

Upvotes: 0

Cole Tierney
Cole Tierney

Reputation: 10304

You could merge yaml files passed on the command line:

import sys
import yaml

def merge_dict(m_list, s):
    for m in m_list:
        if m['id'] == s['id']:
            m.update(**s)
            return
    m_list.append(s)

merged_list = []
for f in sys.argv[1:]:
    with open(f) as s:
        for source in yaml.safe_load(s)['list']:
            merge_dict(merged_list, source)

print(yaml.dump({'list': merged_list}), end='')

Results:

list:
- id: 1
  name: name-from-B
- id: 2
  name: name-from-A
  title: title-from-B
- id: 3
  name: name-from-B
  title: title-from-B

Upvotes: 1

pymym213
pymym213

Reputation: 321

You can use ruamel.yaml python package to do it.

if you have python already installed, run following command in terminal :

pip install ruamel.yaml

python code adapted from here. (tested, and works fine) :

import ruamel.yaml
yaml = ruamel.yaml.YAML()

#Load the yaml files
with open('/test1.yaml') as fp:
    data = yaml.load(fp)
with open('/test2.yaml') as fp:
    data1 = yaml.load(fp)
# dict to contain merged ids
merged = dict()

#Add the 'list' from test1.yaml to test2.yaml 'list'
for i in data1['list']:
    for j in data['list']:
        # if same 'id'
        if i['id'] == j['id']:
            i.update(j)
            merged[i['id']] = True

# add new ids if there is some
for j in data['list']:
    if not merged.get(j['id'], False):
        data1['list'].append(j)

#create a new file with merged yaml
with open('/merged.yaml', 'w') as yaml_file:
    yaml.dump(data1, yaml_file)

Upvotes: 1

Related Questions