CunningFatalist
CunningFatalist

Reputation: 473

Parsing book sources with Regex in JavaScript

I am currently building a parser that is supposed to extract different sources from an absolute mess :) I've been working on it for a couple of days and it's working just fine. However, I encountered a serious problem when trying to parse the last segments of a book. There is no character that can really help me separating stuff:

var str = 'John Doe, Max Mustermann, Taro Tanaka, My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';

As you can see, the string contains names separated by a comma and a title that contains a comma but does not require quotes around it. Also, there are similar versions in my testdata which look like this:

var str = 'John Doe, Max Mustermann, Taro Tanaka: My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';

This doesn't make it easier. What I want is to store the book's title in an object (which already contains date, publisher,...) and, afterwards, remove the title from the source string. I'd be very happy if someone could help me out :)

Here's a fiddle to play around with: http://jsfiddle.net/TheFatalist/927645vz/1/ However, I'd recommend using this tool: http://leaverou.github.io/regexplained/

Thanks a lot in advance! I will update the fiddle, as soon as I can figure something out.

Edit: To avoid confusion: I am searching for the regex that separates title and name. Or another workaround. I hope there is some kind of way to identify this... but I cannot figure it out.

Upvotes: 0

Views: 93

Answers (2)

vks
vks

Reputation: 67968

^(.*?)(?:,(?=[^,]*:)|\s(?=\w+:))(.*)$

Try this.Grab the matches.Match 2 contains title detail

Or simply use regex.split to get your results with this re.

See demo.

http://regex101.com/r/kM7rT8/5

Upvotes: 1

Tom Ritsema
Tom Ritsema

Reputation: 468

As @nnnnnn states it's hard to do this in a very reliable manner but may get somewhere when you try to match from the end of the string:

var str = 'John Doe, Max Mustermann, Taro Tanaka, My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';
var str2 = 'John Doe, Max Mustermann, Taro Tanaka: My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';

// assume all characters after semicolon as title and include all characters and whitespace before the semicolon
// everything before the title is assumed to be authors
var regex = /(.*?)((\w|\s)+:[^:]+)$/;

var str_match = regex.exec(str);
$('body').append('<br>string: "'+str+'"<br>title: '+ str_match[2]+'<br>authors: '+str_match[1]);

$('body').append('<br><br>');

var str2_match = regex.exec(str2);
$('body').append('<br>string: "'+str2+'"<br>title: '+ str2_match[2]+'<br>authors: '+str2_match[1]);

Upvotes: 1

Related Questions