pros89
pros89

Reputation: 13

Perl: perl regex for extracting values from complex lines

Input log file:

 Nservdrx_cycle 4       servdrx4_cycle
 HCS_cellinfo_st[10]     (type = (LTE { 2}),cell_param_id = (28)
 freq_info =  (10560),band_ind = (rsrp_rsrq{ -1}),Qoffset1 = (0)
 Pcompensation = (0),Qrxlevmin = (-20),cell_id = (7), 
 agcreserved{3} = ({ 0, 0, 0 }))    
 channelisation_code1   16/5 { 4}   channelisation_code1
 sync_ul_info_st_   (availiable_sync_ul_code = (15),uppch_desired_power = 
 (20),power_ramping_step = (3),max_sync_ul_trans = (8),uppch_position_info =
 (0))
 trch_type  PCH { 7}    trch_type8      
 last_report    0   zeroth bit

I was trying to extract only integer for my above inputs but I am facing some issue with if the string contain integer at the beginning and at the end

For ( e.g agcreserved{3},HCS_cellinfo_st[10],Qoffset1) here I don't want to ignore {3},[10] and 1 but in my code it does. since I was extracting only integer.

Here I have written simple regex for extracting only integer.

MY SIMPLE CODE:

 use strict;
 use warnings;
 my $Ipfile  = 'data.txt';
 open my $FILE, "<", $Ipfile or die "Couldn't open input file: $!";
 my @array;
 while(<$FILE>)
 {
  while ($_ =~ m/( [+-]?\d+ )/xg)
  { 
   push @array, ($1);
  }

 }
print "@array \n";

output what I am getting for above inputs:

4 4 10 2 28 10560 -1 1 0 0 -20 7 3 0 0 0 1 16 5 4 1 15 20 3 8 0 7 8 0

expected output:

4 2 28 10560 -1 0 0 -20 7 0 0 0 4 15 20 3 8 0 7 0

If some body can help me with explanation ?

Upvotes: 1

Views: 63

Answers (1)

ardavey
ardavey

Reputation: 161

You are catching every integer because your regex has no restrictions on which characters can (or can not) come before/after the integer. Remember that the /x modifier only serves to allow whitespace/comments inside your pattern for readability.

Without knowing a bit more about the possible structure of your output data, this modification achieves the desired output:

  while ( $_ =~ m! [^[{/\w] ( [+-]?\d+ ) [^/\w]!xg ) {
    push @array, ($1);
  }

I have added rules before and after the integer to exclude certain characters. So now, we will only capture if:

  • There is no [, {, /, or word character immediately before the number
  • There is no / or word character immediately after the number

If your data could have 2-digit numbers in the { N} blocks (e.g. PCH {12}) then this will not capture those and the pattern will need to become much more complex. This solution is therefore quite brittle, without knowing more of the rules about your target data.

Upvotes: 2

Related Questions