Efficiently building a file system tree structure with nested hashes

Question

I have a list of the diff stats per file for a commit (using diff --numstat in Git) that I need to parse into a tree structure as a hash so I can use it as JSON. The raw data is in a format like this:

1       1       app/assets/javascripts/foo.js.coffee
2       1       app/assets/javascripts/bar.js
16      25      app/assets/javascripts/baz.js.coffee
11      0       app/controllers/foo_controller.rb
3       2       db/schema.rb
41      1       lib/foobar.rb

I need to parse this into a nested hash format something like the following:

{ name: "app", children: [
  { name: "assets", children: [
    { name: "javascripts", children: [
      { name: "foo.js.coffee", add: 1, del: 1 },
      { name: "bar.js", add: 2, del: 1 }
      { name: "baz.js.coffee", add: 16, del: 25 }
    ], add: 19, del: 27 },
    ...
  ] } 
] }

Where every level of the tree is represented by its name, children as a hash and the total number of additions and deletions for that tree.

Is there an efficient way to construct a hash like this in Ruby?

dimitarvp · Accepted Answer

Full source here: https://gist.github.com/dimitko/5541709. You can download it and directly run it without any trouble (just make sure to have the awesome_print gem; it shows you the object hierarchy in much more human-readable format).

I enriched your test input a little, to make sure the algorithm doesn't make stupid mistakes.

Given this input:

input = <



And this expected result:

[{:name=>"app", :add=>37, :del=>38, :children=>[{:name=>"assets", :add=>19, :del=>27, :children=>[{:name=>"javascripts", :add=>19, :del=>27, :children=>[{:name=>"bar.js", :add=>2, :del=>1}, {:name=>"baz.js.coffee", :add=>16, :del=>25}, {:name=>"foo.js.coffee", :add=>1, :del=>1}]}]}, {:name=>"controllers", :add=>18, :del=>11, :children=>[{:name=>"bar_controller.rb", :add=>4, :del=>9}, {:name=>"baz_controller.rb", :add=>3, :del=>2}, {:name=>"foo_controller.rb", :add=>11, :del=>0}]}]}, {:add=>3, :del=>2, :name=>"db", :children=>[{:name=>"schema.rb", :add=>3, :del=>2}]}, {:add=>58, :del=>21, :name=>"lib", :children=>[{:name=>"foobar.rb", :add=>41, :del=>1}, {:name=>"tasks", :add=>17, :del=>20, :children=>[{:name=>"cache.rake", :add=>12, :del=>7}, {:name=>"import.rake", :add=>5, :del=>13}]}]}]


And this code:

def git_diffnum_parse_paths(list, depth, out)
  to = 1
  base = list.first[:name][depth]
  while list[to] and list[to][:name][depth] == base do
    to += 1
  end

  if list.first[:name][depth+1]
    out << {name: base, add: 0, del: 0, children: []}

    # Common directory found for the first N records; recurse deeper.
    git_diffnum_parse_paths(list[0..to-1], depth + 1, out.last[:children])

    add = del = 0
    out.last[:children].each do |x| add += x[:add].to_i; del += x[:del].to_i; end
    out.last[:add] = add
    out.last[:del] = del
  else
    # It's a file, we can't go any deeper.
    out << {name: list.first[:name].last, add: list.first[:add].to_i, del:     list.first[:del].to_i}
  end

  if list[to]
    # Recurse in to try find common directories for the deeper records.
    git_diffnum_parse_paths(list[to..-1], depth, out)
  end

  nil
end

def to_git_diffnum_tree(txt)
  items = []
  txt.split("
").each do |line|
    m = line.match(/(\d+)\s+(\d+)\s+(.+)/).to_a[1..3]
    items << {add: m[0], del: m[1], name: m[2]}
  end

  items.sort! { |a,b|
    a[:name] <=> b[:name]
  }

  items.each do |item|
    item[:name] = item[:name].split("/")
  end

  out = []
  git_diffnum_parse_paths(items, 0, out)
  out
end


And this code, which is using it:

require 'awesome_print'
out = to_git_diffnum_tree(input)
puts; ap out; puts
puts; puts "Expected result:"; puts expected.inspect
puts; puts "Actual result: ";  puts out.inspect
puts; puts "Are expected and actual results identical: #{expected == out}"


It seems to produce what you want.

Notes:


I am sorting the array of parsed entries by directory/file names. This is done to avoid walking the entire list to search for a common directory; instead, the algorithm can scan the list up until the first non-match.
I am far from thinking that's the most optimal solution, but it's what I have came up with for a free hour.
I have left some [un-]commented puts statements in the gist, in case you wanna have a rough glimpse on how does the algorithm work.
In case you want to give it a more solid test, try something like this:


git diff --numstat `git rev-list --max-parents=0 HEAD | head -n 1` HEAD

That'd give you number of additions and deletions since the initial commit (provided your Git version is >=1.7.4.2), which is a far bigger input where you can give the algorithm a lot more rigorous testing.

Hope I helped.

Efficiently building a file system tree structure with nested hashes

Answers (2)

Related Questions