zock
zock

Reputation: 223

Perl: Match exact word from tab-delimited file

I have a tab-delimited file (containing 2 columns) in the following format:

ABA-1 (tab)           CDF@
ABA-1 (tab)           EFG
ZYA (tab)             ABA-1 this
EFG that this (tab)   ZYA

I want to match only /EFG/ and not /EFG that this/. Similarly, I want to only match /ABA-1/ and not /ABA-1 this/.

The following pattern doesn't work:

$line=~ /^(\w*\-?\w*\@?)\t*(\w*\-?\w*\@?)$/

I have tried using word boundaries (\b) but it doesn't work either.

Any ideas on how to tackle this issue? Any help will be highly appreciated. Thanks a lot!

Upvotes: 2

Views: 3391

Answers (3)

Rohit Jain
Rohit Jain

Reputation: 213223

$line=~ /^(\w+)[^\t]*\t(\w+).*$/

This will capture only the first word before and after the tab.

UPDATE: - If you want to match any non-space character before the first space, then you can try this pattern: -

my $line = "ABA-1\tCDF@";
my $line1 = "ZYA \t  ABA-1 this";

if ($line=~ /^([^\s]+)[^\t]*\t\s*([^\s]+).*$/) {    
    print "$1 $2";
}

if ($line=~ /^([^\s]+)[^\t]*\t\s*([^\s]+).*$/) {    
    print "$1 $2";
}

OUTPUT: -

ABA-1 CDF@
ZYA ABA-1

Upvotes: 1

Borodin
Borodin

Reputation: 126722

Your regex doesn't work for a couple of reasons. Firstly your tab can't be optional, otherwise the line won't be split properly. Secondly, there is nothing in your pattern to account for the possible characters after the parts that you want to match, i.e. nothing that matches that this.

You can solve the first by adding .*? after each capture (or, for the second capture, just removing the trailing $ anchor). The second problem is fixed just by changing \t* to \t.

This modification works with your sample data

$line =~ /^(\w*\-?\w*\@?).*?\t(\w*\-?\w*\@?).*?$/

but it isn't very pretty!

It looks like you just want all strings of non-space characters directly after a tab or the beginning of the line

This program encodes that idea as a regex

use strict;
use warnings;

my @data = (
  "ABA-1\tCDF@",
  "ABA-1\tEFG", 
  "ZYA\tABA-1 this",
  "EFG that this\tZYA",
);

for (@data) {
  my @fields = /(?:^|\t)(\S+)/g;
  print "@fields\n";
}

output

ABA-1 CDF@
ABA-1 EFG
ZYA ABA-1
EFG ZYA

Upvotes: 3

dan1111
dan1111

Reputation: 6566

This will match two words (containing no spaces) separated by a single tab on a line:

$line=~ /^(\w+)\t(\w+)$/

Update: that will exclude any lines that have something like "ABA this". However, maybe you want to capture just the ABA out of "ABA this". This would do that for you:

$line=~ /^([A-Z]+)[^\t]*\t([A-Z]+)/

Update: here is a new pattern for the new requirements. It matches the first non-white-space portion in each column.

$line=~ /^([^\s]+).*\t\s*([^\s]+)/

Upvotes: 1

Related Questions