Joshua Muheim
Joshua Muheim

Reputation: 13195

Ruby: Parse a simple markdown files (having similar, but not equal structure) and fill contents into object's attributes

I have a folder full of markdown files. Each of them I want to read into the following Ruby object:

class File
  attr_accessor :title, :description, :content
end

The markdown files usually look like this:

# This is the title

This is some description.

And even more description.

## This is an h2

Bla bla.

## This is another h2

More bla bla.

### This is even an h3

Again, more bla bla.

## Again, an h2

etc. etc.

This should result in this Ruby object:

File:
  h1: "This is the title"
  description: "This is some description.\n\nAnd even more description."
  content: "## This is an h2...etc. etc."

To assign the content of the file to the Ruby object's definition, I could simply use a regular expression which would extract title (the first H1), description (the text right between H1 and the following H2), and content (all the rest).

But the files do not always look exactly like this:

These exceptions can occur in combinations, ie. a file without H1 and description:

## This is an h2

Bla bla.

## This is another h2

More bla bla.

This should result in this Ruby object:

File:
  h1: nil
  description: nil
  content: "## This is an h2...More bla bla."

Or a file with H1 but no description:

# This is the title

## This is an h2

Bla bla.

This should result in this Ruby object:

File:
  h1: "This is the title"
  description: nil
  content: "## This is an h2...Bla bla.

Or a file with no H1, but a description:

This is a description.

Some more description.

## This is an h2

Bla bla.

This should result in this Ruby object:

File:
  h1: nil
  description: This is a description...Some more description.
  content: "## This is an h2...Bla bla.

I wonder whether I can do this using a single fancy regular expression (I'm no expert in that), or whether I should try to somehow split it into several process steps. I asked a similar question here: Markdown: Regex to find all content following an heading #2 (but stop at another heading #2), but I couldn't get the regex to run properly using Ruby with the exceptions described above.

Any idea how to solve this problem is highly welcome. Thank you.

PS: I also thought about parsing the markdown using a markdown parser and then use Nokogiri or something which would allow me to parse the results. But this feels like way too much overhead for such a basically simple requirement.

Upvotes: 0

Views: 452

Answers (1)

Fravadona
Fravadona

Reputation: 17158

Given your examples:

examples = []

examples << <<-EOS
# This is the title    
This is some description.    
And even more description.    
## This is an h2    
Bla bla.    
## This is another h2    
More bla bla.    
### This is even an h3    
Again, more bla bla.    
## Again, an h2    
etc. etc.    
EOS
 
examples << <<-EOS
## This is an h2    
Bla bla.    
## This is another h2    
More bla bla.
EOS
 
examples << <<-EOS
# This is the title    
## This is an h2    
Bla bla.
EOS

examples << <<-EOS
This is a description.
Some more description.
## This is an h2
Bla bla.
EOS

You can do this:

examples.each do |text|
  text =~ /\A(?:(?:^#(?!#)([^\n]*))?(.*?)(?=^#|\z))?(.*)\z/m
  title,description,content = [$1,$2,$3].map { |s|
    s.strip! if s
    s unless (s && s.empty?)
  }

puts <<-EOS
File:
  title: #{title.inspect}
  description: #{description.inspect}
  content: #{content.inspect}
EOS
end

Note: The regexp doesn't care about number of consecutive newlines.

Which gives you:

File:
  h1: "This is the title"
  description: "This is some description.\nAnd even more description."
  content: "## This is an h2\nBla bla.\n## This is another h2\nMore bla bla.\n### This is even an h3\nAgain, more bla bla.\n## Again, an h2\netc. etc."
File:
  h1: nil
  description: nil
  content: "## This is an h2\nBla bla.\n## This is another h2\nMore bla bla."
File:
  h1: "This is the title"
  description: nil
  content: "## This is an h2\nBla bla."
File:
  h1: "This is the title"
  description: "This is some description."
  content: nil
File:
  h1: nil
  description: "This is a description.\nSome more description."
  content: "## This is an h2\nBla bla."

Upvotes: 1

Related Questions