Reputation: 3958
as the title says I need to extract content out of long text with certain fields.
I have this text as below
Name: David Jones
Office Address: 148 Hulala Street Date: 24/11/2013
Agent No: 1234,
Address: 259 Yolo Road Start Date: 22/11/2013 Due Date: 29/11/2013
Type: Human Properties: None Ago: 29
And I have these labels for specific fields in the text
Name, Office Address, Date, Agent No, Address, Type, Properties, Age
And the result I want to get is
Name: 'David Jones',
Office Address: '148 Hulala Street',
Date: '24/11/2013',
Agent No: '1234',
Address: '259 Yolo Road',
Type: 'Human'
Properties: 'None',
Age: ''
that has completely parsed the content with each field. Important thing to note here is the original text can possibly have typo (E.g., Ago instead of Age) and extra fields that do not exist in the list of labels (E.g., Start Date and Due Date do not exist in the label list)
. So the code will ignore any un-matching text and try to find only matching result.
I tried to resolve this by going through loops for each line, check if a line contains the field, and see if the line also contains more fields.
Currently I have the following code.
structure = ['Name','Office Address','Date','Agent No','Address','Type','Properties','Age'];
obj = {};
for (i = 0; i < textLines.length; i++) {
matchingFields = [];
for (j = 0; j < structure.length; j++) {
if (textLines[i].indexOf(structure[j] + ':') !== -1) {
if (matchingFields.length === 0 && textLines[i].indexOf(structure[j] + ':') === 0) {
matchingFields.push(structure[j]);
structure.splice(structure.indexOf(structure[j--]), 1);
} else if (textLines[i].indexOf(structure[j] + ':') > textLines[i].indexOf(matchingFields[matchingFields.length-1])) {
matchingFields.push(structure[j]);
structure.splice(structure.indexOf(structure[j--]), 1);
}
}
for (j = 0; j < matchingFields.length; j++) {
if (j !== matchingFields.length-1) {
obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length, textLines[i].indexOf(matchingFields[j+1]));
} else {
obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length);
}
obj[matchingFields[j]] = obj[matchingFields[j]].replace(':', '');
if (obj[matchingFields[j]].indexOf(' ') === 0) {
obj[matchingFields[j]] = obj[matchingFields[j]].replace(' ', '');
}
if (obj[matchingFields[j]].charAt(obj[matchingFields[j]].length-1) === ' ') {
obj[matchingFields[j]] = obj[matchingFields[j]].slice(0, obj[matchingFields[j]].length-1);
}
}
}
In some cases it could work fine but with 'Office Address: '
and 'Address: '
existing value for 'Office Address:'
goes into 'Address:'
. Besides, the code looks messy and ugly. Also seems like kind of brute forcing.
I guess there should be a better way. For example using regular expression or something similar. but no external library.
If you have any idea I will appreciate it for sharing.
Upvotes: 0
Views: 91
Reputation: 36511
Assuming the properties are separated by newline characters, you create an object mapping each attribute to its value using:
var str = "Name: David Jones\nOffice Address: 148 Hulala Street\nDate: 24/11/2013\nAgent No: 1234,\nAddress: 259 Yolo Road\\nType: Human Properties: None Age: 29";
var output = {};
str.split(/\n/).forEach(function(item){
var match = (item.match(/([A-Za-z\s]*):\s([A-Za-z0-9\s\/]*)/));
output[match[1]] = match[2];
});
console.log(output)
Upvotes: 1
Reputation: 509
This may help:
> a.substr(a.indexOf("Name"), a.indexOf("Office Address")).split(":")
["Name", " David Jones "]
Upvotes: 1