Reputation: 13195
I have a folder full of markdown files. Each of them I want to read into the following Ruby object:
class File
attr_accessor :title, :description, :content
end
The markdown files usually look like this:
# This is the title
This is some description.
And even more description.
## This is an h2
Bla bla.
## This is another h2
More bla bla.
### This is even an h3
Again, more bla bla.
## Again, an h2
etc. etc.
This should result in this Ruby object:
File:
h1: "This is the title"
description: "This is some description.\n\nAnd even more description."
content: "## This is an h2...etc. etc."
To assign the content of the file to the Ruby object's definition, I could simply use a regular expression which would extract title
(the first H1), description
(the text right between H1 and the following H2), and content
(all the rest).
But the files do not always look exactly like this:
title
)These exceptions can occur in combinations, ie. a file without H1 and description:
## This is an h2
Bla bla.
## This is another h2
More bla bla.
This should result in this Ruby object:
File:
h1: nil
description: nil
content: "## This is an h2...More bla bla."
Or a file with H1 but no description:
# This is the title
## This is an h2
Bla bla.
This should result in this Ruby object:
File:
h1: "This is the title"
description: nil
content: "## This is an h2...Bla bla.
Or a file with no H1, but a description:
This is a description.
Some more description.
## This is an h2
Bla bla.
This should result in this Ruby object:
File:
h1: nil
description: This is a description...Some more description.
content: "## This is an h2...Bla bla.
I wonder whether I can do this using a single fancy regular expression (I'm no expert in that), or whether I should try to somehow split it into several process steps. I asked a similar question here: Markdown: Regex to find all content following an heading #2 (but stop at another heading #2), but I couldn't get the regex to run properly using Ruby with the exceptions described above.
Any idea how to solve this problem is highly welcome. Thank you.
PS: I also thought about parsing the markdown using a markdown parser and then use Nokogiri or something which would allow me to parse the results. But this feels like way too much overhead for such a basically simple requirement.
Upvotes: 0
Views: 452
Reputation: 17158
Given your examples:
examples = []
examples << <<-EOS
# This is the title
This is some description.
And even more description.
## This is an h2
Bla bla.
## This is another h2
More bla bla.
### This is even an h3
Again, more bla bla.
## Again, an h2
etc. etc.
EOS
examples << <<-EOS
## This is an h2
Bla bla.
## This is another h2
More bla bla.
EOS
examples << <<-EOS
# This is the title
## This is an h2
Bla bla.
EOS
examples << <<-EOS
This is a description.
Some more description.
## This is an h2
Bla bla.
EOS
You can do this:
examples.each do |text|
text =~ /\A(?:(?:^#(?!#)([^\n]*))?(.*?)(?=^#|\z))?(.*)\z/m
title,description,content = [$1,$2,$3].map { |s|
s.strip! if s
s unless (s && s.empty?)
}
puts <<-EOS
File:
title: #{title.inspect}
description: #{description.inspect}
content: #{content.inspect}
EOS
end
Note: The regexp doesn't care about number of consecutive newlines.
Which gives you:
File:
h1: "This is the title"
description: "This is some description.\nAnd even more description."
content: "## This is an h2\nBla bla.\n## This is another h2\nMore bla bla.\n### This is even an h3\nAgain, more bla bla.\n## Again, an h2\netc. etc."
File:
h1: nil
description: nil
content: "## This is an h2\nBla bla.\n## This is another h2\nMore bla bla."
File:
h1: "This is the title"
description: nil
content: "## This is an h2\nBla bla."
File:
h1: "This is the title"
description: "This is some description."
content: nil
File:
h1: nil
description: "This is a description.\nSome more description."
content: "## This is an h2\nBla bla."
Upvotes: 1