Reputation: 223
I have a tab-delimited file (containing 2 columns) in the following format:
ABA-1 (tab) CDF@
ABA-1 (tab) EFG
ZYA (tab) ABA-1 this
EFG that this (tab) ZYA
I want to match only /EFG/ and not /EFG that this/. Similarly, I want to only match /ABA-1/ and not /ABA-1 this/.
The following pattern doesn't work:
$line=~ /^(\w*\-?\w*\@?)\t*(\w*\-?\w*\@?)$/
I have tried using word boundaries (\b) but it doesn't work either.
Any ideas on how to tackle this issue? Any help will be highly appreciated. Thanks a lot!
Upvotes: 2
Views: 3391
Reputation: 213223
$line=~ /^(\w+)[^\t]*\t(\w+).*$/
This will capture only the first word before and after the tab
.
UPDATE: - If you want to match any non-space
character before the first space, then you can try this pattern: -
my $line = "ABA-1\tCDF@";
my $line1 = "ZYA \t ABA-1 this";
if ($line=~ /^([^\s]+)[^\t]*\t\s*([^\s]+).*$/) {
print "$1 $2";
}
if ($line=~ /^([^\s]+)[^\t]*\t\s*([^\s]+).*$/) {
print "$1 $2";
}
OUTPUT: -
ABA-1 CDF@
ZYA ABA-1
Upvotes: 1
Reputation: 126722
Your regex doesn't work for a couple of reasons. Firstly your tab can't be optional, otherwise the line won't be split properly. Secondly, there is nothing in your pattern to account for the possible characters after the parts that you want to match, i.e. nothing that matches that this
.
You can solve the first by adding .*?
after each capture (or, for the second capture, just removing the trailing $
anchor). The second problem is fixed just by changing \t*
to \t
.
This modification works with your sample data
$line =~ /^(\w*\-?\w*\@?).*?\t(\w*\-?\w*\@?).*?$/
but it isn't very pretty!
It looks like you just want all strings of non-space characters directly after a tab or the beginning of the line
This program encodes that idea as a regex
use strict;
use warnings;
my @data = (
"ABA-1\tCDF@",
"ABA-1\tEFG",
"ZYA\tABA-1 this",
"EFG that this\tZYA",
);
for (@data) {
my @fields = /(?:^|\t)(\S+)/g;
print "@fields\n";
}
output
ABA-1 CDF@
ABA-1 EFG
ZYA ABA-1
EFG ZYA
Upvotes: 3
Reputation: 6566
This will match two words (containing no spaces) separated by a single tab on a line:
$line=~ /^(\w+)\t(\w+)$/
Update: that will exclude any lines that have something like "ABA this". However, maybe you want to capture just the ABA out of "ABA this". This would do that for you:
$line=~ /^([A-Z]+)[^\t]*\t([A-Z]+)/
Update: here is a new pattern for the new requirements. It matches the first non-white-space portion in each column.
$line=~ /^([^\s]+).*\t\s*([^\s]+)/
Upvotes: 1