Reputation: 374
I am looking for a way to compare (and, best case, also diff) two YAML files in Ruby; regardless of key order, naturally. So far all solutions I found depended on loading the files with YAML::load_file()
. I cannot do that, however, because the files are dumps of Ruby objects whose class declarations I do not have, so that loading them throws undefined class/module
.
I think I need to load them as string hashes and compare that, but how do I tell Ruby to ignore the type information and just include it into the comparison?
Based on comments: I'm basically interested in text-based comparison, but it must be aware of the "depth" of the data structure. For instance this is an excerpt from one of the files I have:
attributes: !ruby/hash:Example::Attributes
!binary "b2NjaQ==": !ruby/hash:Example::Attributes
!binary "Y29yZQ==": !ruby/hash:Example::Attributes
!binary "aWQ=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
!binary "dGl0bGU=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
So the comparison must be able to identify a match even if the two attributes are in reverse order.
Upvotes: 2
Views: 1484
Reputation: 79743
Psych, Ruby’s Yaml parser, provides several ways to examine Yaml data. The highest level loads the Yaml and provides a Ruby data structure. This is the API that looks at the Yaml tags and tries to load the appropriate Ruby classes, which is causing your problems. It also looks at the format of the data and converts it to various types (e.g. Dates) if it matches.
The next level will parse the Yaml and provide you with an AST containing the “raw” Yaml data. The high level api works by first parsing to this AST and then traversing it using the visitor pattern to create Ruby data (normally a Hash or Array). Unfortunately it doesn’t provide anything in between these two levels, but it is fairly easy to create a parser that creates a simplified data structure.
At its core Yaml data basically consists of scalars (which are basically strings), mappings (hashes) and sequences (arrays) – all of which can have a tag associated with them. The AST provided by Psych consists of these three types (and a couple of others), and we can create our own visitor that traverses it and produces a Ruby structure that consists solely of hashes, arrays and strings.
This is loosely based on the Psych ToRuby
visitor class, but instead of trying to convert the data to the appropriate Ruby type it only creates arrays, hashes and strings, throwing away any data in tags:
require 'psych'
class ToPlain < Psych::Visitors::Visitor
# Scalars are just strings.
def visit_Psych_Nodes_Scalar o
o.value
end
# Sequences are arrays.
def visit_Psych_Nodes_Sequence o
o.children.each_with_object([]) do |child, list|
list << accept(child)
end
end
# Mappings are hashes.
def visit_Psych_Nodes_Mapping o
o.children.each_slice(2).each_with_object({}) do |(k,v), h|
h[accept(k)] = accept(v)
end
end
# We also need to handle documents...
def visit_Psych_Nodes_Document o
accept o.root
end
# ... and streams.
def visit_Psych_Nodes_Stream o
o.children.map { |c| accept c }
end
# Aliases aren't handles here :-(
def visit_Psych_Nodes_Alias o
# Not implemented!
end
end
(Note this doesn’t handle aliases. It’s not too difficult to add support for them, have a look at what ToRuby
does, in particular the register
method and how it’s used.)
You can make use of this like this:
# Could also use parse_stream or parse_file here
ast = YAML.parse(my_data)
data = ToPlain.new.accept(ast)
# data now consists of just arrays, hashes and strings
If you use this on your example data, the result is a hash that looks something like this:
{
"attributes"=>{
"b2NjaQ=="=>{
"Y29yZQ=="=>{
"aWQ="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
},
"dGl0bGU="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
}
}
}
}
}
Whilst the keys are little unwieldy because you are using binary data, you can still make comparisons like this:
occi_core_id = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["aWQ="]
occi_core_title = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["dGl0bGU="]
puts occi_core_id == occi_core_title
Upvotes: 2