Reputation: 3
So I'm trying to figure out a way to get the difference between two XML trees (examples below) but can't come up with anything. I need the outcome to be an array of differences, with each element in the array containing the node that was changed, how it was changed (added, deleted), and the path to the node.
Edit: Forgot to mention, the order of the XML needs to not matter. I tried using npm/dom-compare, but it doesn't quite give the desired result (with the examples below) because it doesn't expect to see the new tag (dir photos) but gives no information about it past that it found an unexpected tag.
1.
<dir name="rootDir">
<dir name="childDir">
<file name="hello.jpg"/>
</dir>
<file name="linux.txt"/>
<file name="img.png"/>
</dir>
2.
<dir name="rootDir">
<dir name="childDir">
<file name="hello.jpg"/>
<file name="interesting.brain"/>
</dir>
<dir name="photos">
<file name="me.dng"/>
</dir>
<file name="img.png"/>
</dir>
My XML sources will only ever contain and tags.
For example on the two XML docs above, compare(1, 2) should result in: (For my purposes there is no change 'changed', e.g if a files name is changed then it is a new file and the old one is treated as if it were removed not moved, and dirs are not included if their files change).
[
{node: '<file name="interesting.brain"/>', path: '/rootDir/childDir' change: 'added'},
{node: '<dir name="photos">', path: '/rootDir', change: 'added'}
{node: '<file name="linux.txt"/>', path: '/rootDir', change: 'deleted'}
]
My first thought was to first parse the XML strings into JS objects using fast-xml-parser, which results in the following objects:
1.
{ dir: [
{
name: 'rootDir',
dir: [
{
name: 'childDir',
file: [
{ name: 'hello.jpg' }
]
}
],
file: [
{ name: 'linux.txt' },
{ name: 'img.png' }
]
}
] }
2.
{ dir: [
{
name: 'rootDir',
dir: [
{
name: 'childDir',
file: [
{ name: 'hello.jpg' },
{ name: 'interesting.brain' }
]
},
{
name: 'photos',
file: [
{ name: 'me.dng' }
]
}
],
file: [
{ name: 'img.png' }
]
}
] }
However this results in extra complications because the resulting format uses arrays as well as objects, which at the very least increases the mental workload in figuring out how to diff both. It also is probably quite a bit slower since obviously you have to parse the XML string first, not to mention adding in a 3rd party library.
Looking for any advice or a pseudocode algorithm that I can use to solve this problem. Should note I'm using Typescript and targeting ES6 / Node.js.
Cheers.
Upvotes: 0
Views: 459
Reputation: 163468
There's a company called DeltaXML whose entire business model is built around solving this problem. I just mention that so you realise you are tackling a non-trivial problem.
For example, you say: Forgot to mention, the order of the XML needs to not matter.
That illustrates the fact that people want a comparison to reflect the semantics of their particular XML vocabulary, not just the XML syntax itself. Obviously there are many XML vocabularies in which changing the order of elements is a significant change.
Even with plain text or character strings, there's a whole academic literature concerning differencing. For example, read the paper given by David Birnbaum at XML Prague 2020 (https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf#page=57) on implementation of the Needleman-Wunsch algorithm in XSLT 3.0.
Of course, you might be able to invent an algorithm that's adequate to your particular needs without researching the field completely thoroughly. But at the very least, to get a decent answer to this question you need to define your requirements a lot more precisely. One simple example does not constitute a specification.
A particular characteristic of this problem is that optimal algorithms (the ones that identify the minimum number of differences) can be very time-consuming (perhaps O(n^3)) so you may well need to make compromises between the quality of the answer and the time taken to deliver it.
Upvotes: 1
Reputation: 1141
I've created a simple solution based on your description of the problem. It may not be truly optimal, but it gets the job done (hopefully). See if this is what you need.
We'll use the xml-parse package to process XML.
TL;DR: Get the full code here.
So, to solve this problem, we will go through two steps.
STEP 1: Create maps of the XML files
Let's define a data structure called "map" (should've chosen a more descriptive name but couldn't think of one). This map will be a dictionary.
Our map consists of key-value pairs.
So the maps of the two example XML structures you provided will be like this:
Old map:
{
"/rootDir":{
"childDir":"dir",
"linux.txt":"file",
"img.png":"file"
},
"/rootDir/childDir":{
"hello.jpg":"file"
}
}
New map:
{
"/rootDir":{
"childDir":"dir",
"photos":"dir",
"img.png":"file"
},
"/rootDir/childDir":{
"hello.jpg":"file",
"interesting.brain":"file"
},
"/rootDir/photos":{
"me.dng":"file"
}
}
The recursive function to build a map from an XML structure will be like this:
// recursive function to build map
function buildMap(element, path, map) {
map[path] = {}
// const childElements = element.childNodes.filter(childNode => childNode.type === 'element');
for (const childNode of element.childNodes) {
// skip text (because the xml-parse package also returns the unnecessary texts in an XML structure, e.g. line breaks)
if (childNode.type === 'text') continue;
// process child element
// add child element's name to indicate that this path has a child with this name
// use child element's type (dir/file) as the value
map[path][childNode.attributes.name] = childNode.tagName;
// if child element is dir, process it recursively
if (childNode.tagName === 'dir') buildMap(childNode, `${path}/${childNode.attributes.name}`, map);
}
}
STEP 2: Get the differences between the two maps
Now we'll derive the changes from the maps.
Basically, what we'll do is traverse through the paths of the old map, get the set of children in each path (from both maps) and compare the two sets of children to get the changes we need.
The functions for this step are as follows:
// function to get the differences between two maps
function diffMaps(oldMap, newMap) {
const changes = [];
// traverse each path of the old map
for (const key of Object.keys(oldMap)) {
// get children in this path for both old map and new map
const oldChildren = oldMap[key];
const newChildren = newMap[key];
changes.push(...diffChildren(key, oldChildren, newChildren));
}
return changes;
}
// function to get the differences between the children of two maps
function diffChildren(path, oldChildren, newChildren) {
const changes = [];
// traverse each child of the old children
for (const key of Object.keys(oldChildren)) {
// if new children also have that child ==> no change ==> remove that child from new children and continue
if (newChildren[key]) {
// the reason for deleting is that after we have deleted all the keys that are present in old children, the remaining keys in new children will be the newly added ones.
delete newChildren[key];
continue;
}
// new children don't have that child ==> deleted ==> add to changes
const type = oldChildren[key];
changes.push({
node: type === 'dir' ? `<dir name="${key}">` : `<file name="${key}"/>`,
path: path,
change: 'deleted'
});
}
// traverse each child of the new children and add them to changes
for (const key of Object.keys(newChildren)) {
const type = newChildren[key];
changes.push({
node: type === 'dir' ? `<dir name="${key}">` : `<file name="${key}"/>`,
path: path,
change: 'added'
});
}
return changes;
}
FINALLY: Testing
Now that we have the necessary functions, just plug in our data and run :)
const oldXmlString = String.raw`
<dir name="rootDir">
<dir name="childDir">
<file name="hello.jpg"/>
</dir>
<file name="linux.txt"/>
<file name="img.png"/>
</dir>
`.trim();
const newXmlString = String.raw`
<dir name="rootDir">
<dir name="childDir">
<file name="hello.jpg"/>
<file name="interesting.brain"/>
</dir>
<dir name="photos">
<file name="me.dng"/>
</dir>
<file name="img.png"/>
</dir>
`.trim();
const oldXml = xml.parse(oldXmlString);
const newXml = xml.parse(newXmlString);
const oldRoot = oldXml[0];
const newRoot = newXml[0];
// maps with path as key and child nodes' names as value
const oldMap = {};
const newMap = {};
buildMap(oldRoot, `/${oldRoot.attributes.name}`, oldMap);
buildMap(newRoot, `/${newRoot.attributes.name}`, newMap);
const diffs = diffMaps(oldMap, newMap);
console.log(diffs);
Output:
[ { node: '<file name="linux.txt"/>',
path: '/rootDir',
change: 'deleted' },
{ node: '<dir name="photos">',
path: '/rootDir',
change: 'added' },
{ node: '<file name="interesting.brain"/>',
path: '/rootDir/childDir',
change: 'added' } ]
Upvotes: 1