Reputation: 584
I need help using regular expressions in JavaScript. I have the following string (it has no line breaks):
var str = 'DetailedLog 18.11.2015 14:41:35.299 Neutral : 0,5704 Happy : 0,6698 Sad : 0,0013 Angry : 0,0040 Surprised : 0,0129 Scared : 0,0007 Disgusted : 0,0048 Valence : 0,6650 Arousal : 0,2297 Gender : Male Age : 20 - 30 Beard : None Moustache : None Glasses : Yes Ethnicity : Caucasian Y - Head Orientation : -1,7628 X - Head Orientation : 2,5652 Z - Head Orientation : -3,0980 Landmarks : 375,4739 - 121,6879 - 383,2627 - 113,6502 - 390,8202 - 110,3507 - 396,1021 - 109,7039 - 404,9615 - 110,9594 - 443,2603 - 108,9765 - 451,9454 - 106,7192 - 457,1207 - 106,8835 - 464,1162 - 109,5496 - 470,9659 - 116,8992 - 387,4940 - 132,0171 - 406,4031 - 130,4482 - 441,6239 - 128,6356 - 460,6862 - 128,1997 - 419,0713 - 161,6479 - 425,3519 - 155,1223 - 431,9862 - 160,6411 - 406,9320 - 190,3831 - 411,4790 - 188,7656 - 423,1751 - 185,6583 - 428,5339 - 185,6882 - 433,7802 - 184,8167 - 445,6192 - 186,3515 - 450,8424 - 187,2787 - 406,0796 - 191,1880 - 411,9287 - 193,5352 - 417,9666 - 193,6567 - 424,0851 - 193,4941 - 428,6678 - 193,5652 - 433,2172 - 192,7540 - 439,3548 - 192,0136 - 445,4181 - 191,1532 - 451,6007 - 187,9486 - 404,5193 - 190,6352 - 412,8277 - 185,4609 - 421,1355 - 181,2883 - 428,3182 - 181,1826 - 435,2024 - 180,2258 - 443,9292 - 183,2533 - 453,1117 - 187,2288 - 405,9689 - 193,2750 - 410,0249 - 199,8118 - 416,0457 - 203,0374 - 423,4839 - 204,1818 - 429,9247 - 204,2175 - 436,3620 - 203,1305 - 443,4268 - 200,9355 - 448,9572 - 197,1335 - 452,0746 - 190,0314 Quality : 0,8137 Mouth : Closed Left Eye : Open Right Eye : Open Left Eyebrow : Lowered Right Eyebrow : Lowered Identity : NO IDENTIFICATION';
My goal is to construct a usable JavaScript object from this mess, with properties and their values. I am trying to use regular expressions because as far as I know they perform faster than parsing with a custum for loop. The code doing this needs to be fast.
For property names I tried to construct an array of strings with this code:
str.match(/(\b[A-Z].*?\b)(?=(\s(:|\d)))/g);
This gets outputed:
["DetailedLog", "Neutral", "Happy", "Sad", "Angry", "Surprised", "Scared",
"Disgusted", "Valence", "Arousal", "Gender", "Male Age", "Beard", "None Moustache",
"None Glasses", "Yes Ethnicity", "Caucasian Y - Head Orientation", "X - Head Orientation",
"Z - Head Orientation", "Landmarks", "Quality", "Mouth", "Closed Left Eye",
"Open Right Eye", "Open Left Eyebrow", "Lowered Right Eyebrow", "Lowered Identity"]
Here I have a problem with strings that consist of two capitalized words like "Male Age" or "Open Left Eyebrow" or "Closed Left Eye". The first word I will use for the property value so it is getting in the way...
My first queston is what is the correct regular expression to give me this output:
["DetailedLog", "Neutral", "Happy", "Sad", "Angry", "Surprised", "Scared",
"Disgusted", "Valence", "Arousal", "Gender", "Age", "Beard", "Moustache",
"Glasses", "Ethnicity", "Y - Head Orientation", "X - Head Orientation",
"Z - Head Orientation", "Landmarks", "Quality", "Mouth", "Left Eye",
"Right Eye", "Left Eyebrow", "Right Eyebrow", "Identity"]
Thank you for any help.
Upvotes: 0
Views: 93
Reputation: 42021
(?:(DetailedLog) ([^ ]+ [^ ]+)|(\b[A-Z][A-Za-z -]+?) : ((?:(?:-?[\d,]+)(?: - -?[\d,]+)*|(?:(?:[A-Z ]+\b|[A-Za-z]+)))))(?:$| )
https://regex101.com/r/lP9pG2/3
The basic idea here is because we don't know where a "key" begins we try to define the "value" more precisely and stop capturing when we know the value ends.
DetailedLog
will always be followed by 2 sets of characters separated by a space, these characters including the space will be considered the value.Happy
the values will be one of the following:
-
.Note that the last one "A sequence of all upper-case characters and spaces" is to capture the last part Identity
specifically NO IDENTIFICATION
. The values of Identity
or any other value that might contain just letters and spaces may cause issues if they are not all upper-case.
var result = {};
var myregexp = /(?:(DetailedLog) ([^ ]+ [^ ]+)|(\b[A-Z][A-Za-z -]+?) : ((?:(?:-?[\d,]+)(?: - -?[\d,]+)*|(?:(?:[A-Z ]+\b|[A-Za-z]+)))))(?:$| )/g;
var match = myregexp.exec(str);
while (match != null) {
if (match[1]) {
result[match[1]] = match[2];
} else {
result[match[3]] = match[4];
}
match = myregexp.exec(str);
}
This results in result
containing the following object:
{
"DetailedLog": "18.11.2015 14:41:35.299",
"Neutral": "0,5704",
"Happy": "0,6698",
"Sad": "0,0013",
"Angry": "0,0040",
"Surprised": "0,0129",
"Scared": "0,0007",
"Disgusted": "0,0048",
"Valence": "0,6650",
"Arousal": "0,2297",
"Gender": "Male",
"Age": "20 - 30",
"Beard": "None",
"Moustache": "None",
"Glasses": "Yes",
"Ethnicity": "Caucasian",
"Y - Head Orientation": "-1,7628",
"X - Head Orientation": "2,5652",
"Z - Head Orientation": "-3,0980",
"Landmarks": "375,4739 - 121,6879 - 383,2627 - 113,6502 - 390,8202 - 110,3507 - 396,1021 - 109,7039 - 404,9615 - 110,9594 - 443,2603 - 108,9765 - 451,9454 - 106,7192 - 457,1207 - 106,8835 - 464,1162 - 109,5496 - 470,9659 - 116,8992 - 387,4940 - 132,0171 - 406,4031 - 130,4482 - 441,6239 - 128,6356 - 460,6862 - 128,1997 - 419,0713 - 161,6479 - 425,3519 - 155,1223 - 431,9862 - 160,6411 - 406,9320 - 190,3831 - 411,4790 - 188,7656 - 423,1751 - 185,6583 - 428,5339 - 185,6882 - 433,7802 - 184,8167 - 445,6192 - 186,3515 - 450,8424 - 187,2787 - 406,0796 - 191,1880 - 411,9287 - 193,5352 - 417,9666 - 193,6567 - 424,0851 - 193,4941 - 428,6678 - 193,5652 - 433,2172 - 192,7540 - 439,3548 - 192,0136 - 445,4181 - 191,1532 - 451,6007 - 187,9486 - 404,5193 - 190,6352 - 412,8277 - 185,4609 - 421,1355 - 181,2883 - 428,3182 - 181,1826 - 435,2024 - 180,2258 - 443,9292 - 183,2533 - 453,1117 - 187,2288 - 405,9689 - 193,2750 - 410,0249 - 199,8118 - 416,0457 - 203,0374 - 423,4839 - 204,1818 - 429,9247 - 204,2175 - 436,3620 - 203,1305 - 443,4268 - 200,9355 - 448,9572 - 197,1335 - 452,0746 - 190,0314",
"Quality": "0,8137",
"Mouth": "Closed",
"Left Eye": "Open",
"Right Eye": "Open",
"Left Eyebrow": "Lowered",
"Right Eyebrow": "Lowered",
"Identity": "NO IDENTIFICATION"
}
myregexp
) outside of any loop or repeated function call so the regular expression only gets compiled once.Here is a sample: http://jsperf.com/image-features-log-parsing/5
Keep in mind this sample compiles the regular expressions every time in the loops.
Upvotes: 1
Reputation: 5525
I think there a bit too many uses for whitespace in your string to use a simple regex. Even stripping the keywords results in the following mess, parted into individual steps to make it clearer:
str.replace(/([0-9])( - )([0-9])/g,"$1-$3") // get rid of spaces between landmarks hyphen
.replace(/\: [^ ]+/g,",") // get rid of values
.replace(/(DetailedLog)([0-9.: ]+)/,"$1, ") // get rid of date
.replace(/(Identity)(.*)/,"$1") // get rid of value of "identity"
You proposed a simple parser but that would not work if you do not know the keywords in advance. If you do know them in advance: just build that simple parser and use the keywords as delimiters. I'm pretty sure it will be even faster than any highly complicated regexp. You could use JISON to safe you some headaches.
Ah, I'm too late. Again.
Nevertheless, here is a very simpel, unoptimized parser for benchmarking:
// That's how I made the keys-array, not actively used here
str.replace(/([0-9])( - )([0-9])/g,"$1-$3")
.replace(/\: [^ ]+/g,",")
.replace(/(DetailedLog)([0-9.: ]+)/,"$1, ")
.replace(/(Identity)(.*)/,"$1")
.replace(/([^,]+)/g,"\"$1\"" )
.replace(/\" /g,"\"")
.replace(/ \"/g,"\"");
var keys = ["DetailedLog", "Neutral" , "Happy" , "Sad" , "Angry" , "Surprised" ,
"Scared" , "Disgusted" , "Valence" , "Arousal" , "Gender" , "Age" , "Beard" ,
"Moustache" , "Glasses" , "Ethnicity" , "Y - Head Orientation" ,
"X - Head Orientation" , "Z - Head Orientation" ,
"Landmarks" , "Quality" , "Mouth" , "Left Eye" ,
"Right Eye" , "Left Eyebrow" , "Right Eyebrow" , "Identity"];
var db = {};
var value;
for(var k = 0;k < keys.length - 1;k++){
var regex = new RegExp("("+keys[k] + "[ :]+)([^:]+)(" + keys[k+1] + ")");
value = str.match(regex);
if(value){
db[keys[k]] = value[2].trim();
}
}
// last one
db[keys[keys.length -1]] = value[2].trim();
// take a look
JSON.stringify(db)
It should be fast enough for a couple of hundred or so rows, especially if you optimize it a bit (pre-calculate the regular expression for example, it's a bit silly to do it in a loop) but at least you have one benchmark to compare with because I don't think you can do it much slower without some effort.
Upvotes: 0
Reputation: 153
I don't have enough reputation to comment, so I'll provide a partial solution. Use the regex: /(\b[A-Za-z -]+?) : (.+? )/g
and then only use Capture Group 1. the result is as shown here: https://regex101.com/r/qJ7jU7/1
Only down side is that "DetailedLog" is not captured.
From my experience, not all data are suited for Regex in ONE go, sometimes you'll need to break it down to multiple parts.
Upvotes: 0