Eugene Yu
Eugene Yu

Reputation: 3958

how to parse & format content of text into object

as the title says I need to extract content out of long text with certain fields.

I have this text as below

Name: David Jones
Office Address: 148 Hulala Street Date: 24/11/2013
Agent No: 1234,
Address: 259 Yolo Road Start Date: 22/11/2013 Due Date: 29/11/2013
Type: Human Properties: None Ago: 29

And I have these labels for specific fields in the text

Name, Office Address, Date, Agent No, Address, Type, Properties, Age

And the result I want to get is

Name: 'David Jones',
Office Address: '148 Hulala Street',
Date: '24/11/2013',
Agent No: '1234',
Address: '259 Yolo Road',
Type: 'Human'
Properties: 'None',
Age: ''

that has completely parsed the content with each field. Important thing to note here is the original text can possibly have typo (E.g., Ago instead of Age) and extra fields that do not exist in the list of labels (E.g., Start Date and Due Date do not exist in the label list). So the code will ignore any un-matching text and try to find only matching result.

I tried to resolve this by going through loops for each line, check if a line contains the field, and see if the line also contains more fields.

Currently I have the following code.

structure = ['Name','Office Address','Date','Agent No','Address','Type','Properties','Age'];
obj = {};
for (i = 0; i < textLines.length; i++) {
  matchingFields = [];
  for (j = 0; j < structure.length; j++) {
    if (textLines[i].indexOf(structure[j] + ':') !== -1) {
      if (matchingFields.length === 0 && textLines[i].indexOf(structure[j] + ':') === 0) {
        matchingFields.push(structure[j]);
        structure.splice(structure.indexOf(structure[j--]), 1);
      } else if (textLines[i].indexOf(structure[j] + ':') > textLines[i].indexOf(matchingFields[matchingFields.length-1])) {
        matchingFields.push(structure[j]);
        structure.splice(structure.indexOf(structure[j--]), 1);
      }
    }

    for (j = 0; j < matchingFields.length; j++) {
      if (j !== matchingFields.length-1) {
        obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length, textLines[i].indexOf(matchingFields[j+1]));
      } else {
        obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length);
      }

      obj[matchingFields[j]] = obj[matchingFields[j]].replace(':', '');
      if (obj[matchingFields[j]].indexOf(' ') === 0) {
        obj[matchingFields[j]] = obj[matchingFields[j]].replace(' ', '');
      }
      if (obj[matchingFields[j]].charAt(obj[matchingFields[j]].length-1) === ' ') {
        obj[matchingFields[j]] = obj[matchingFields[j]].slice(0, obj[matchingFields[j]].length-1);
      }
    }
  }

In some cases it could work fine but with 'Office Address: ' and 'Address: ' existing value for 'Office Address:' goes into 'Address:'. Besides, the code looks messy and ugly. Also seems like kind of brute forcing.

I guess there should be a better way. For example using regular expression or something similar. but no external library.

If you have any idea I will appreciate it for sharing.

Upvotes: 0

Views: 91

Answers (2)

Rob M.
Rob M.

Reputation: 36511

Assuming the properties are separated by newline characters, you create an object mapping each attribute to its value using:

var str = "Name: David Jones\nOffice Address: 148 Hulala Street\nDate: 24/11/2013\nAgent No: 1234,\nAddress: 259 Yolo Road\\nType: Human Properties: None Age: 29";
var output = {};

str.split(/\n/).forEach(function(item){ 
    var match = (item.match(/([A-Za-z\s]*):\s([A-Za-z0-9\s\/]*)/));
    output[match[1]] = match[2];
});

console.log(output)

Upvotes: 1

nius
nius

Reputation: 509

This may help:

> a.substr(a.indexOf("Name"), a.indexOf("Office Address")).split(":")
["Name", " David Jones "]

Upvotes: 1

Related Questions