so_mc
so_mc

Reputation: 169

JavaScript RegEx How to extract substring dynamically

I need to extract a sub string inside a dynamic input, I've achieved the output I need, but it's only pure hard code, so it's not that dynamic and reliable. Is there any other way for me to extract the part "B1003 = Engineering Business Card" (Item Description) & "2"(Quantity), these are both dynamic, an entirely different item could be input such as; "O1003 = Pencil", "O1004 = Sticky Notes". Is there a way to code this in regex that would enable a more reliable code?

The input being read here is from an extracted text using Tesseract OCR, I need to extract the needed information and pass it to another service.

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card Business Cards 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

//rule 1 - Gets all Items + Quantity
//rule 2 - Gets all Items
//rule 3 - Gets all Quantity
//resultArray - Contains Quantity + Item e.g. 2 B1003 Engineering Business Cards

var rule1 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes) (.*) ([0-9]|[0-9][0-9]|[0-9][0-9][0-9])/
var rule2 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes)/
var rule3 = /([0-9]|[0-9][0-9]|[0-9][0-9][0-9])/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
    var result = element.match(rule1)
    if (result!=null){
        var itemName = result[0].match(rule2)
        var quantity = result[0].match(rule3)
        resultarray.push (quantity[0]+ " " + itemName[0])
    }
});

console.log (resultarray.join(", "))

Note: Just to make things clearer, this is the image I'm extracted the text from Legend: Blue - Static Unboxed - Dynamic Yellow - Text needed to be extracted (Also dynamic)

- This is the image extracted, first line is static, second line is dynamic

Expected result is 2 B1003 = Engineering Business Card(, B1002 = Accountant Business Card - will output if there is a similar item in the code) Please check the comments on requisition variable.

Again, I can already get the desired output, I just need to know how the code can be done differently and more dynamically and reliable using RegEx. Please bear with me as I don't know much about RegEx. Thanks!

Upvotes: 1

Views: 317

Answers (2)

Vincent
Vincent

Reputation: 4803

Short answer:

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

//rule 1 - Gets all Items + Quantity
//rule 2 - Gets all Items
//rule 3 - Gets all Quantity
//resultArray - Contains Quantity + Item e.g. 2 B1003 Engineering Business Cards

var rule1 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes)[^\d]+(\d+) .*/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
    var result = element.match(rule1)
    if (result!=null){
        var itemName = result[1]
        var quantity = result[2]
        resultarray.push (quantity + " " + itemName)
    }
});

console.log (resultarray.join(", "))

Output:

2 B1003 = Engineering Business Card, 5 O1003 = Pencil

Long answer:

There are many things to fix:

  1. Use only rule 1 (after some modifications) to match everything (item name and quantity) using (\d+)
  2. Get rid of rule 2 and 3
  3. Use result[1] as item name and result[2] as quantity

Please note that all your fields are space separated and fields can contain spaces so your data is not structured. It would be a lot more reliable if you had for instance a tab delimited file. So the rule I used to find the quantity is "ignore everything after the product name until there is a number" but if some day you have a category that contains a number, you will be stuck and there will be nothing you can do without a structured file

Upvotes: 2

CertainPerformance
CertainPerformance

Reputation: 371148

You can put it all into a single regular expression, and capture the quantity in one group, and the itemName in the other. Then extract those groups from the match (if there's a match):

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card Business Cards 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

var rule = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes).*(\d{1,3})/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
  const match = element.match(rule);
  if (match) {
    const [, itemName, quantity] = match;
    resultarray.push(quantity + ' ' + itemName);
  }
});

console.log(resultarray)

For a more minimal example:

const input = `Lines
foo 1
bar 2
baz don't match`;
const pattern = /(foo|bar) (\d+)/;
const output = [];
input
  .split('\n')
  .forEach((line) => {  
    const match = line.match(pattern);
    if (match) {
      const [, itemName, quantity] = match;
      output.push(quantity + ' ' + itemName);
    }
  });
console.log(output);

Upvotes: 0

Related Questions