Reputation: 1505

Parse multiline text with pattern

here is a little example:

02-09-17 1:01 PM - Some User (Add comments)
Hello,

How are you?

Regards,

02-09-17 3:29 PM - Another User (Add comments)
Hey,

Thanks, all is fine.

Some another text here.

02-09-17 4:30 AM - Just a User (Add comments)
some text
with
multiline

I want to parse and process this three comments. What is the best way for this?

Tried regex like this - http://www.rubular.com/r/k1CHJ1STTD but have problems with /m flag. Without multiline flag for regex - can`t catch "body" of comment.

Also tried to split by regex:

text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/)
=> ["",
"02-09-17 1:01 PM - Some User (Add comments)",
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n",
"02-09-17 3:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text     here.\n" + "\n",
"02-09-17 4:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n",
"02-09-17 5:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 6:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n"]

But this is not comfortable solution.

Ideally I want to get regex captures with three or two group matches, for example:

1. 02-09-17 1:01 PM
2. Some User (Add comments)
3. Hello,

How are you?

Regards,

for each comment, or, Array of comments:

[['02-09-17 1:01 PM - Some User (Add comments) Hello,

How are you?

Regards,'],[...]]

Any ideas? Thanks.

Upvotes: 0

Answers (3)

Eric Duminil

Reputation: 54313

slice_before might be easier to understand than a huge scan, and it has the advantage of keeping the pattern (split removes it)

data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block|
  time, user = block.shift.strip.split(' - ')
  [time, user, block.join.strip]
end

p data
# [["02-09-17 1:01 PM",
#   "Some User (Add comments)",
#   "Hello,\n\nHow are you?\n\nRegards,"],
#  ["02-09-17 3:29 PM",
#   "Another User (Add comments)",
#   "Hey,\n\nThanks, all is fine.\n\nSome another text here."],
#  ["02-09-17 4:30 AM",
#   "Just a User (Add comments)",
#   "some text\nwith\nmultiline"]]

Upvotes: 1

Casimir et Hippolyte

Reputation: 89649

You can keep it simple using two splits (one for the whole string and one for each block):

text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }

You can also use the scan method, but it's a little more fastidious:

text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)

Upvotes: 2

ssc-hrep3

Reputation: 16099

You can use this regular expression:

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) matches the first group, the date and time. The date must consist of three numbers, separated by a dash, followed by the time with AM/PM
(.*?)\r?\n((?:.|\r?\n)+?) matches the username up to the first line break (\r?\n) as the second group. Afterwards, anything including linebreaks is matching and building the third group, the comment.
This won't work, because it would handle everything from the beginning of the comment up to the end of the file as a comment. Therefore, you need to select the next date/time format, so that it stops there. You can do this just by repeating the date/time format after the comment and matching non-greedy, but this will include the next datetime already in the current match and therefore exclude it in the next match (which will lead to a skip of every second match). To circumvent this, you can use a positive lookahead: (?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$). This matches a number afterwards, but does not include it in the match. The last comment must then end at the end of the string $.
You need to use the global flag /g but mustn't use the multi-line flag /g, because the matching of the comment goes over multiple lines.

Here is a live example: https://regex101.com/r/o63GQE/2

Upvotes: 0

Parse multiline text with pattern

Answers (3)

Related Questions