Nick Price
Nick Price

Reputation: 963

Parse multilingual flight info logs and extract potentially space-separated flight numbers

I have data like the following

<terminal:Text>1  #VS   5 J9 C9 D9 I9 Z9 W9 S9 H9 LHRMIA 1235 1705      744 0E</terminal:Text>
<terminal:Text>        K9 Y9 B9 R9 L9 U9 M9 E9 Q9 X9 N9 O9 </terminal:Text>
<terminal:Text>2  #IB4637 F9 A9 J9 C9 D9 R9 I. W9 LHRMIA 1415 1825   *  744 0E</terminal:Text>
<terminal:Text>        Z. Y9 B9 H9 K. M. L. V. S. N. Q. O.</terminal:Text>
<terminal:Text>3*O#AA  57 F7 A7 P7 J7 R7 D7 I7 Y7 LHRMIA 0945 1415      777 0E</terminal:Text>
<terminal:Text>        B7 H7 K7 M7 L7 V7 G7 S7 Q7 N7 O7 </terminal:Text>

I am trying to work out the best way of separating this data so I get the data I need. To start with I do the following

$elNum = 0;

while ($elNum < $elements->length) 
{
    $flightInfo = $elements->item($elNum)->nodeValue;

    if ( preg_match('/^\\d/', $flightInfo ) === 1 )
    {
        ++$elNum;
    }
}

$elements represents each Text element which I am passing it. Here is what I know. The main row always starts with a digit which is why I am doing that preg_match(). The row following a row which starts with a digit is related to the previous row. Essentially, in the example above, there are two rows for each flight.

I was thinking about exploding the row on spaces, but I might do this when it comes to getting the seats (J9, M., I7 etc). To start with, I need the flight numbers.

A flight number always starts with a #. The airline code is always 2 uppercase letters, the flight number can be 1-4 digits. So with the above, I could do something like

$pat = strpos($flightInfo, "#");

That will get me to the start of each flight number. Here is the tricky part, flight numbers are not the same as in the example above. The first one is VS then 2 spaces and then 5 (so VS5). The second one is straight forward, its all together (IB4637). The last one is AA then 2 spaces and then 57 (AA57). Sometimes there is only one space.

So the airline code will always be attached to the # and I know its always a length of 2 so to get it I could do something like

$fltcode = substr($flightInfo, $pat+1, 2);

My main question is how can I handle the number part of it when it could be 1-4 in length, it could be attached to the flight number but it could also be separated by one or more spaces?

Upvotes: 1

Views: 58

Answers (2)

mickmackusa
mickmackusa

Reputation: 47874

If you are scanning lines of a file, fscanf() may be a good choice. For the sake of online demonstration, I'll show a script making iterated sscanf() calls.

Match, but don't capture characters before #, then match the letters with trailing spaces, then the digits. If 3 matches were made in the line, push the right-trimmed and concatenated value into the result array.

Code: (Demo)

$log = <<<LOG
<terminal:Text>1  #VS   5 J9 C9 D9 I9 Z9 W9 S9 H9 LHRMIA 1235 1705      744 0E</terminal:Text>
<terminal:Text>        K9 Y9 B9 R9 L9 U9 M9 E9 Q9 X9 N9 O9 </terminal:Text>
<terminal:Text>2  #IB4637 F9 A9 J9 C9 D9 R9 I. W9 LHRMIA 1415 1825   *  744 0E</terminal:Text>
<terminal:Text>        Z. Y9 B9 H9 K. M. L. V. S. N. Q. O.</terminal:Text>
<terminal:Text>3*O#AA  57 F7 A7 P7 J7 R7 D7 I7 Y7 LHRMIA 0945 1415      777 0E</terminal:Text>
<terminal:Text>        B7 H7 K7 M7 L7 V7 G7 S7 Q7 N7 O7 </terminal:Text>
LOG;

$result = [];
foreach (explode("\n", $log) as $line) {
    if (sscanf($line, '<terminal:Text>%*[^#]#%[^0-9]%d', $letters, $digits) === 3) {
        $result[] = rtrim($letters) . $digits;
    }
}
var_export($result);

Output:

array (
  0 => 'VS5',
  1 => 'IB4637',
  2 => 'AA57',
)

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

If you ask for a regex solution, you can try this regex

\d+[^#]*\#(\p{Lu}{2})\s*(\d{1,4})\b

or

(?<=<terminal:Text>)\d+[^#]*\#(\p{Lu}{2})\s*(\d{1,4})\b (if the element node is in front of the text one)

Basically, it captures the flight number in 2 groups, consisting of 2 uppercase letters and 1 to 4 digits, that you need to add up.

Output:

MATCH 1
1.  [4-6]   `VS`
2.  [9-10]  `5`
MATCH 2
1.  [113-115]   `IB`
2.  [115-119]   `4637`
MATCH 3
1.  [221-223]   `AA`
2.  [225-227]   `57`

Upvotes: 1

Related Questions