Reputation: 1544
I have a text file structured in sections that I would like to break up as an array with string elements for each section. The content of each section will then be manipulated differently depending on the section. I'm using irb at the moment and will most likely break this out to be a separate ruby script file.
I've created both a string object and file object from the input file ("sample" and "sample_file" respectively) to test out different methods. I'm sure a file read loop cold work here but I believe a simple match is all I need.
The file looks like this:
*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
Example output:
[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]
which is [string of section, string of section, string of section] Most of my attempts fail because of complications due to inconsistent opening and closing conditions or the multiline nature of my need.
Here are my closest attempts that either create unwanted elements (like a string containing the closing asterisk of one header and the opening of another) or only grab a header.
This matches the headers:
sample.scan(/\*{3}.*/)
This matches headers and sections but creates elements from closing and opening asterisks, I don't fully understand the look ahead an behind assertions but I think the solution will look something like this based on my searches for a solution.
sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)
I'm now working with something to find lines that start with space and or asterisks but it's not there yet!
sample.scan(/^(\s+\*+|\*+).*/)
Any direction is greatly appreciated.
Upvotes: 1
Views: 630
Reputation: 160601
Ruby's Enumerable includes slice_before
which is really useful for this sort of task:
str = "*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
# "",
# "randomly formatted content",
# "multiple lines",
# ""],
# [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
# "",
# "This sections info"],
# [" **** sub headers sometime occur***",
# " I'm okay with treating this as normal headers for now.",
# " I think sub headers may have something consistent about them.",
# "",
# ""],
# ["*** Header ***", " info for this section"]]
Using slice_before
allows me to use a very simple pattern to locate a landmark/target that indicates where the sub-array breaks occur. Using /^\s*\*{3}/
finds lines that start with a possible string of whitespace followed by three '*'
. Once found, a new sub-array begins.
If you want each sub-array to actually be a single string instead of an array of lines in the block, map(&:join)
is your friend:
str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header *** randomly formatted content multiple lines",
# " *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# " **** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# " *** Header *** info for this section "]
And, if you want to strip leading and trailing whitespace you can use strip
in conjunction with map
:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
or:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines ",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them. ",
# "*** Header *** info for this section "]
or:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
depending on what you want to do.
Splitting by "\r" produces a better output on my real file than "\n"
str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a
Use /\r?\n/
, which is a regular expression that looks for optional carriage-returns followed by a new-line. Windows uses a "\r\n"
combination to mark the end of a line, whereas Mac OS and *nix use only "\n"
. By doing this you're not tying your code to being Windows-only.
I don't know if slice_before
was developed for this particular use, but I've used it for tearing apart text files and breaking them down into paragraphs, and splitting networking device configurations into chunks, which made parsing a lot easier in either case.
Upvotes: 3
Reputation: 55002
A more readable idea might be to split before the pattern using lookahead:
str.split /(?=\n *\*{3})/
Upvotes: 0
Reputation: 47284
There are many ways to accomplish what you're looking to do, although if you want to use regex a pattern something such as this might work (depending on the exact text, you might need to tweak it a bit):
(.*[*].*.+[^*]*)
Example:
http://regex101.com/r/aU0xU1/2
Code:
Aboout the pattern (.*[*].*.+[^*]*)
:
.* matches any character (except newline)
(Between zero and unlimited times), [greedy]
[*] matches astertik * the literal character *
.* matches any character (except newline)
(Between zero and unlimited times), [greedy]
.+ matches any character (except newline)
(Between one and unlimited times), [greedy]
[^*]* match anything except for an asterik
(Between zero and unlimited times), [greedy]
Upvotes: 1
Reputation: 67988
^((?:[ ]+|[ ]*\*)+.+)$
You can try this.Instead of \s
use [ ]
as \s
covers \n
too.See demo.Grab the captures.
http://regex101.com/r/vR4fY4/14
Upvotes: 0