ray an
ray an

Reputation: 1288

How to write a regular expresion when data does not follow single format?

I have a Javascript string:

let entries = `23-05-1990  Some heading
               27-05-1990  Liar Liar
               29-05-1990  Another Heading
               30-05-1990  50/50
               31-05-1990  My day`

Using regex I need to process this string and generate two arrays:

// 1) date array:
date = ["23-05-1990","27-05-1990", "29-05-1990", "30-05-1990", "31-05-1990"]

// 2) headings array
headings = ["Some heading", "Liar Liar" ,"Another Heading",  "50/50", "My day"]

So far this is simple: Split by line break and then pass each individual date-heading to a regex. Get the date and the heading and append them to their respective arrays.

But the issue is I don't have a consistent format for the data.

Some of the data is in this format. i.e. heading comes before the date

    `Liar Liar          27-05-1990  
     Another Heading    29-05-1990  
     50/50              30-05-1990  
     My day             31-05-1990  `

there may be a separator between the heading and the date.

   `23-05-1990 : Some heading
    27-05-1990 : Yes Man`

   `29-05-1990: Another Heading`

   `30-05-1990 - 50/50
    31-05-1990 - My day`

So, date and heading would be there(we don't know which one comes first) but the separator may or may not be present.

Also,

  1. The separator is one of the three listed below:

    " " (space), "-" , ":"

  2. the heading can't start or end with any character other than an alphabet or an int.

Upvotes: 2

Views: 81

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110725

You could match the following regular expression. The date string will be in capture group 1 or 4 and the other will be empty. The heading will be in capture group 2 or 3 and the other will be empty.

^(?:(\d{2}-\d{2}-\d{4}) *[-:]? *([A-Z\d].*)|([A-Z\d].*)(?<![ :-]) *[-:]? *(\d{2}-\d{2}-\d{4}))$

Start your engine!

As seen at the link, "$1$4" returns the date string and "$2$3" returns the heading.

Javascript's regex engine performs the following operations.

^                      : assert beginning of string
(?:                    : begin non-capture group
  (\d{2}-\d{2}-\d{4})  : match date and save to capture group 1
  [ ]*[-:]?[ ]*        : match 0+ spaces, optional '-' or ':',
                         0+ spaces 
  ([A-Z\d].*)          : match heading and save to capture group 2
|                      : or
  ([A-Z\d].*)          : match heading and save to capture group 3
  (?<![ :-])           : negative lookbehind asserts previous
                         character is neither ' ', ':' nor '-'
  [ ]*[-:]?[ ]*        : match 0+ spaces, optional '-' or ':',
                         0+ spaces
  (\d{2}-\d{2}-\d{4})  : match date and save to capture group 4 
)                      : end non-capture group
$                      : assert end of string

Upvotes: 2

mike.k
mike.k

Reputation: 3457

This works but it doesn't account for duplicates, so if that is a problem then you can filter those out after, or use key/value pairs instead on an array.

Part of the while loop was from regex101.com

const regexes = [
    /((?<date>\d{2}-\d{2}-\d{4})[ :\-]+(?<title>.*)[\r\n])/gm,
    /(?<title>.*)[ :\-]+((?<date>\d{2}-\d{2}-\d{4})[\r\n])/gm
];
const str = `23-05-1990  Some heading
27-05-1990  Liar Liar
29-05-1990  Another Heading
30-05-1990  50/50
31-05-1990  My day
Liar Liar          27-05-1990
Another Heading    29-05-1990
50/50              30-05-1990
My day             31-05-1990
23-05-1990 : Some heading
27-05-1990 : Yes Man
29-05-1990: Another Heading
30-05-1990 - 50/50
31-05-1990 - My day`;

let output = [];

regexes.forEach(regex => {
    let m;
    while ((m = regex.exec(str)) !== null) {
        // This is necessary to avoid infinite loops with zero-width matches
        if (m.index === regex.lastIndex) {
            regex.lastIndex++;
        }

        output.push([m.groups.date.trim(), m.groups.title.trim()]);
    }
    
});

console.log(output);

Output is:

[
  [ '23-05-1990', 'Some heading' ],
  [ '27-05-1990', 'Liar Liar' ],
  [ '29-05-1990', 'Another Heading' ],
  [ '30-05-1990', '50/50' ],
  [ '31-05-1990', 'My day' ],
  [ '23-05-1990', 'Some heading' ],
  [ '27-05-1990', 'Yes Man' ],
  [ '29-05-1990', 'Another Heading' ],
  [ '30-05-1990', '50/50' ],
  [ '27-05-1990', 'Liar Liar' ],
  [ '29-05-1990', 'Another Heading' ],
  [ '30-05-1990', '50/50' ],
  [ '31-05-1990', 'My day' ]
]

Upvotes: 1

Related Questions