Reputation: 1544

Chunk ruby string into section with regex

I have a text file structured in sections that I would like to break up as an array with string elements for each section. The content of each section will then be manipulated differently depending on the section. I'm using irb at the moment and will most likely break this out to be a separate ruby script file.

I've created both a string object and file object from the input file ("sample" and "sample_file" respectively) to test out different methods. I'm sure a file read loop cold work here but I believe a simple match is all I need.

The file looks like this:

*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section

Example output:

[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]

which is [string of section, string of section, string of section] Most of my attempts fail because of complications due to inconsistent opening and closing conditions or the multiline nature of my need.

Here are my closest attempts that either create unwanted elements (like a string containing the closing asterisk of one header and the opening of another) or only grab a header.

This matches the headers:

sample.scan(/\*{3}.*/)

This matches headers and sections but creates elements from closing and opening asterisks, I don't fully understand the look ahead an behind assertions but I think the solution will look something like this based on my searches for a solution.

sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)

I'm now working with something to find lines that start with space and or asterisks but it's not there yet!

sample.scan(/^(\s+\*+|\*+).*/)

Any direction is greatly appreciated.

Upvotes: 1

Answers (4)

the Tin Man

Reputation: 160601

Ruby's Enumerable includes slice_before which is really useful for this sort of task:

str = "*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
#      "",
#      "randomly formatted content",
#      "multiple lines",
#      ""],
#     [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
#      "",
#      "This sections info"],
#     ["       **** sub headers sometime occur***",
#      "           I'm okay with treating this as normal headers for now.",
#      "           I think sub headers may have something consistent about them.",
#      "",
#      ""],
#     ["*** Header ***", "  info for this section"]]

Using slice_before allows me to use a very simple pattern to locate a landmark/target that indicates where the sub-array breaks occur. Using /^\s*\*{3}/ finds lines that start with a possible string of whitespace followed by three '*'. Once found, a new sub-array begins.

If you want each sub-array to actually be a single string instead of an array of lines in the block, map(&:join) is your friend:

str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "     *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "           **** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "    *** Header ***      info for this section    "]

And, if you want to strip leading and trailing whitespace you can use strip in conjunction with map:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "**** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "*** Header ***      info for this section"]

or:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header ***  randomly formatted content multiple lines ",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)  This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.  ",
#     "*** Header *** info for this section "]

or:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
#     "*** Header *** info for this section"]

depending on what you want to do.

Splitting by "\r" produces a better output on my real file than "\n"

str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a

Use /\r?\n/, which is a regular expression that looks for optional carriage-returns followed by a new-line. Windows uses a "\r\n" combination to mark the end of a line, whereas Mac OS and *nix use only "\n". By doing this you're not tying your code to being Windows-only.

I don't know if slice_before was developed for this particular use, but I've used it for tearing apart text files and breaking them down into paragraphs, and splitting networking device configurations into chunks, which made parsing a lot easier in either case.

Upvotes: 3

pguardiario

Reputation: 55002

A more readable idea might be to split before the pattern using lookahead:

str.split /(?=\n *\*{3})/

Upvotes: 0

l'L'l

Reputation: 47284

There are many ways to accomplish what you're looking to do, although if you want to use regex a pattern something such as this might work (depending on the exact text, you might need to tweak it a bit):

(.*[*].*.+[^*]*)

Example:

http://regex101.com/r/aU0xU1/2

Code:

http://ideone.com/oMsb50

Aboout the pattern (.*[*].*.+[^*]*):

.*    matches any character (except newline)
       (Between zero and unlimited times),  [greedy]
[*]   matches astertik * the literal character *
.*    matches any character (except newline)
       (Between zero and unlimited times),  [greedy]
.+    matches any character (except newline)
       (Between one and unlimited times),   [greedy]
[^*]* match anything except for an asterik
       (Between zero and unlimited times),  [greedy]

Regular expression visualization

Upvotes: 1

vks

Reputation: 67988

^((?:[ ]+|[ ]*\*)+.+)$

You can try this.Instead of \s use [ ] as \s covers \n too.See demo.Grab the captures.

http://regex101.com/r/vR4fY4/14

Upvotes: 0

Chunk ruby string into section with regex

Answers (4)

Related Questions