Reputation: 610
I have a list of the diff stats per file for a commit (using diff --numstat in Git) that I need to parse into a tree structure as a hash so I can use it as JSON. The raw data is in a format like this:
1 1 app/assets/javascripts/foo.js.coffee
2 1 app/assets/javascripts/bar.js
16 25 app/assets/javascripts/baz.js.coffee
11 0 app/controllers/foo_controller.rb
3 2 db/schema.rb
41 1 lib/foobar.rb
I need to parse this into a nested hash format something like the following:
{ name: "app", children: [
{ name: "assets", children: [
{ name: "javascripts", children: [
{ name: "foo.js.coffee", add: 1, del: 1 },
{ name: "bar.js", add: 2, del: 1 }
{ name: "baz.js.coffee", add: 16, del: 25 }
], add: 19, del: 27 },
...
] }
] }
Where every level of the tree is represented by its name, children as a hash and the total number of additions and deletions for that tree.
Is there an efficient way to construct a hash like this in Ruby?
Upvotes: 1
Views: 865
Reputation: 3834
Define "efficient". If your problem is "performance", your solution isn't ruby.
Unless you're literally running this script on the Linux source code, I wouldn't be worrying about performance, just clarity of intent.
I took inspiration from @dimitko's solution and I minimized the code used.
https://gist.github.com/x1024/3d0f9ad61fcb4b189be3
def git_group lines, root = 'root'
if lines.count == 1 and lines[0][:name].empty? then
return {
name: root,
add: lines.map { |l| l[:add] }.reduce(0, :+),
del: lines.map { |l| l[:del] }.reduce(0, :+),
}
end
lines = lines.group_by { |line| line[:name].shift }
.map { |key, value| git_group(value, key) }
return {
name: root,
add: lines.map { |l| l[:add] }.reduce(0, :+),
del: lines.map { |l| l[:del] }.reduce(0, :+),
children: lines
}
end
def to_git_diffnum_tree(txt)
data = txt.split("\n")
.map { |line| line.split() }
.map { |line| {add: line[0].to_i, del: line[1].to_i, name: line[2].split('/')} }
.sort_by { |item| item[:name] }
git_group(data)[:children]
end
And if you are willing to compromise with your data format (i.e. return the same data but in a different structure), you can do this with even less code:
https://gist.github.com/x1024/5ecfdfe886e31f8b5ab9
def git_group lines
dirs = lines.select { |line| line[:name].count > 1 }
files = (lines - dirs).map! { |file| [file.delete(:name).shift, file] }
dirs_processed = dirs.group_by { |dir| dir[:name].shift }
.map { |key, value| [key, git_group(value)] }
data = dirs_processed.concat(files)
return {
add: data.map { |k,l| l[:add] }.reduce(0, :+),
del: data.map { |k,l| l[:del] }.reduce(0, :+),
children: Hash[data]
}
end
def to_git_diffnum_tree(txt)
data = txt.split("\n")
.map { |line| line.split() }
.map { |line| {add: line[0].to_i, del: line[1].to_i, name: line[2].split('/')} }
.sort_by { |item| item[:name] }
git_group(data)[:children]
end
Remember kids, writing C++ in Ruby is bad.
Upvotes: 0
Reputation: 2383
Full source here: https://gist.github.com/dimitko/5541709. You can download it and directly run it without any trouble (just make sure to have the awesome_print
gem; it shows you the object hierarchy in much more human-readable format).
I enriched your test input a little, to make sure the algorithm doesn't make stupid mistakes.
Given this input:
input = <<TEXT
2 1 app/assets/javascripts/bar.js
16 25 app/assets/javascripts/baz.js.coffee
1 1 app/assets/javascripts/foo.js.coffee
4 9 app/controllers/bar_controller.rb
3 2 app/controllers/baz_controller.rb
11 0 app/controllers/foo_controller.rb
3 2 db/schema.rb
41 1 lib/foobar.rb
12 7 lib/tasks/cache.rake
5 13 lib/tasks/import.rake
TEXT
And this expected result:
[{:name=>"app", :add=>37, :del=>38, :children=>[{:name=>"assets", :add=>19, :del=>27, :children=>[{:name=>"javascripts", :add=>19, :del=>27, :children=>[{:name=>"bar.js", :add=>2, :del=>1}, {:name=>"baz.js.coffee", :add=>16, :del=>25}, {:name=>"foo.js.coffee", :add=>1, :del=>1}]}]}, {:name=>"controllers", :add=>18, :del=>11, :children=>[{:name=>"bar_controller.rb", :add=>4, :del=>9}, {:name=>"baz_controller.rb", :add=>3, :del=>2}, {:name=>"foo_controller.rb", :add=>11, :del=>0}]}]}, {:add=>3, :del=>2, :name=>"db", :children=>[{:name=>"schema.rb", :add=>3, :del=>2}]}, {:add=>58, :del=>21, :name=>"lib", :children=>[{:name=>"foobar.rb", :add=>41, :del=>1}, {:name=>"tasks", :add=>17, :del=>20, :children=>[{:name=>"cache.rake", :add=>12, :del=>7}, {:name=>"import.rake", :add=>5, :del=>13}]}]}]
And this code:
def git_diffnum_parse_paths(list, depth, out)
to = 1
base = list.first[:name][depth]
while list[to] and list[to][:name][depth] == base do
to += 1
end
if list.first[:name][depth+1]
out << {name: base, add: 0, del: 0, children: []}
# Common directory found for the first N records; recurse deeper.
git_diffnum_parse_paths(list[0..to-1], depth + 1, out.last[:children])
add = del = 0
out.last[:children].each do |x| add += x[:add].to_i; del += x[:del].to_i; end
out.last[:add] = add
out.last[:del] = del
else
# It's a file, we can't go any deeper.
out << {name: list.first[:name].last, add: list.first[:add].to_i, del: list.first[:del].to_i}
end
if list[to]
# Recurse in to try find common directories for the deeper records.
git_diffnum_parse_paths(list[to..-1], depth, out)
end
nil
end
def to_git_diffnum_tree(txt)
items = []
txt.split("\n").each do |line|
m = line.match(/(\d+)\s+(\d+)\s+(.+)/).to_a[1..3]
items << {add: m[0], del: m[1], name: m[2]}
end
items.sort! { |a,b|
a[:name] <=> b[:name]
}
items.each do |item|
item[:name] = item[:name].split("/")
end
out = []
git_diffnum_parse_paths(items, 0, out)
out
end
And this code, which is using it:
require 'awesome_print'
out = to_git_diffnum_tree(input)
puts; ap out; puts
puts; puts "Expected result:"; puts expected.inspect
puts; puts "Actual result: "; puts out.inspect
puts; puts "Are expected and actual results identical: #{expected == out}"
It seems to produce what you want.
Notes:
puts
statements in the gist, in case you wanna have a rough glimpse on how does the algorithm work.git diff --numstat `git rev-list --max-parents=0 HEAD | head -n 1` HEAD
That'd give you number of additions and deletions since the initial commit (provided your Git version is >=1.7.4.2), which is a far bigger input where you can give the algorithm a lot more rigorous testing.
Hope I helped.
Upvotes: 3