user2708928
user2708928

Reputation: 31

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).

The header info that I want is of the format ex: BGA 1

Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.

To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.

Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.

#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";

while(my $line = <$info>){

    if ($line eq "/BGA/"){
    print <>,"\n";
        }
}
close $file;

Upvotes: 3

Views: 4637

Answers (3)

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118128

First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:

Let's look at the few critical places in your code:

while(my $line = <$info>) {
    if ($line eq "/BGA/") {

Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is

    if ($line =~ /BGA/) {

Now, once you fix that, you are going to run into a problem with the following statement:

print <>,"\n";

What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.

Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

Upvotes: 0

Masterfool
Masterfool

Reputation: 119

If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:

while (my $line = <$info>) {
  while ( $line =~ /BGA\s(\d+)/g ) {
    print "$1\n";
  }
}

This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.

Upvotes: 0

pajaja
pajaja

Reputation: 2202

if ($line =~ /BGA\s(\d+)/){
  #your code
  print "BGA number $1 \n";
  #your code
}

And $1 variable will have the number you want

Upvotes: 2

Related Questions