Reputation: 1879
I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX
) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
Upvotes: 0
Views: 133
Reputation: 6204
You can use split
here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Upvotes: 0
Reputation: 2859
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
Upvotes: 1
Reputation: 67910
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line
into the variables, if they exist. Note that parentheses are not required when using the /g
modifier. Use an array instead if you want to capture all existing matches.
Upvotes: 3
Reputation: 1051
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named: - word1 - word2
Upvotes: 1
Reputation: 91508
\s
includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z]
to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Upvotes: 1
Reputation: 93026
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s
is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s
and m
, and g
is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z]
with \p{L}
. \p{L}
is a Unicode property that will match every letter in every language.
Upvotes: 1
Reputation: 17131
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
Upvotes: 3
Reputation: 72947
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
Upvotes: -1