Reputation: 385
I am currently attempting to parse a conversation file in Javascript. Here is an example of such a conversation.
09/05/2016, 13:11 - Joe Bloggs: Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later! 09/05/2016, 13:47 - Jane Doe: Hey! I'm in london from the 12th-16th of june! Hope you can make it down :) sorry it's a bit annoying i couldn't make it there til a sunday! 09/05/2016, 14:03 - Joe Bloggs: Right I'll speak to my boss! I've just requested 5 weeks off in November/December to visit Aus so I'll see if I can negotiate some other days! When does your uni term end in November? I'm thinking of visiting perth first then going to the east coast! 09/05/2016, 22:32 - Jane Doe: Oh that'll be awesome if you come to aus! Totally understand if it's too hard for you to request more days off in june. I finish uni early November! So should definitely be done by then if you came here 09/05/2016, 23:20 - Joe Bloggs: I could maybe get a couple of days 😊 when do you fly into London on the Sunday? Perfect! I need to speak to everyone else to make sure they're about. I can't wait to visit but it's so far away! 09/05/2016, 23:30 - Jane Doe: I fly in at like 7.30am so I'll have that whole day! I'm sure the year will fly since it's may already haha 09/05/2016, 23:34 - Joe Bloggs: Aw nice one! Even if I can get just Monday off I can get an early train on Sunday 😊
My current regular expression looks like this
(\d{2}\/\d{2}\/\d{4}),\s(\d(?:\d)?:\d{2})\s-\s([^:]*):\s(.*?)(?=\s*\d{2}\/|$)/gm
My approach is almost there and gives me 4 groups as expected
{
"group": 1,
"value": "09/05/2016"
},
{
"group": 2,
"value": "13:11"
},
{
"group": 3,
"value": "Joe Bloggs"
},
{
"group": 4,
"value": "Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later!"
}
The problem arises when a message (group 4) contains a carriage return. (see the message at line 3 in the example snippet).
I've done some research and using
[\s\S]does not solve my issue. The pattern simply stops and moves onto the next occurrence.
For the third conversation the message is cut off at the carriage return.
Any help would be appreciated!
Upvotes: 4
Views: 269
Reputation: 4069
Try
(\d{2}\/\d{2}\/\d{4}),\s(\d{1,2}:\d{2})\s-\s([^:]*):\s+(.*(?:\n+(?!\n|\d{2}\/).*)*)
(https://regex101.com/r/sA3sB8/2) which scans to the end of the line, then uses a repeated group to first check that the new line doesn't start with \d\d/
(which is the start of a date on the next line(s)), and if it doesn't, to capture that entire line as well.
You can make the negative look-ahead a little more specific if you fear that two digits followed by a forward slash could hit any edge cases. It increases the number of steps, but would make it slightly safer.
If a user actually entered a newline followed by a date in that syntax, you might have problems as it would stop matching at that point. I doubt they would also include a comma and a 24-hour time, though, so that could be one way to handle that scenario.
Example:
09/05/2016, 23:36 - Jane Doe: Great! Let me give you my travel details:
10/01/2016 @ 6am - Arrive at the station
10/01/2016 @ 7am - Get run over by a drunk horse carriage (the driver and the horse were both sober; the carriage stayed up a bit late to drink)
10/01/2016 @ 7:15am - Pull myself out from under the carriage and kick at its wheels vehemently.
09/05/2016, 23:40 - Joe Bloggs: Haha, sounds great.
This is just an example (with the corresponding fix of adding more specifics to the look-ahead to handle it) just to show how a user might add text that could break that particular revision of the regex.
Upvotes: 3