MaMu
MaMu

Reputation: 1879

I can't find proper regexp

I have the following file(like this scheme, but much longer):

LSE           ZTX                       
    SWX         ZURN                    
LSE           ZYT
NYSE                            CGI  

There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between. Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc. I have tried something like:

$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;

I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)

Upvotes: 0

Views: 133

Answers (9)

Kenosis
Kenosis

Reputation: 6204

You can use split here:

use strict;
use warnings;

while (<DATA>) {
    my ( $word1, $word2 ) = split;
    print "($word1, $word2)\n";
}

__DATA__
LSE         ZTX                       
    SWX         ZURN                    
LSE         ZYT
NYSE                            CGI

Output:

(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)

Upvotes: 0

Zack
Zack

Reputation: 2859

^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$

What this does

^             // Matches the beginning of a string
\s*           // Matches a space/tab character zero or more times
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+           // Then matches at least one tab or space
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $2
$             // Matches the end of a string

Upvotes: 1

TLP
TLP

Reputation: 67910

If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:

my ($word1, $word2) = $line =~ /\S+/g;

This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.

Upvotes: 3

Scordo
Scordo

Reputation: 1051

With option "Multiline" this Regex:

^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$

Will give you N matches each containing 2 groups named: - word1 - word2

Upvotes: 1

Toto
Toto

Reputation: 91508

\s includes also tabulation so your regex looks like:

$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;

the first word is in the first group ($1) and the second in $2.

You can change [A-Z] to whatever's more convenient with your needs.

Here is the explanation from YAPE::Regex::Explain

The regular expression:

(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Upvotes: 1

stema
stema

Reputation: 93026

I think this is what you want

^\s*([A-Z]+)\s+([A-Z]+)

See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.

In Perl it is something like this:

($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;

I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.

In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.

Upvotes: 1

OmnipotentEntity
OmnipotentEntity

Reputation: 17131

Always two words, you don't need to match the entire line, so your most simple regex would be:

/(\w+)\s+(\w+)/

Upvotes: 3

Cerbrus
Cerbrus

Reputation: 72947

Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:

Split your string up at newlines, then try this regex:

^\s+(\w+\s+){2}$

This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.

# ^           --> String start
# \s+         --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $           --> String end.

However, if you want to capture the codes alone, try this:

$line =~ /^\s*(\w+)\s+(\w+)/;

# \s*   --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+   --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),

Upvotes: -1

zzzzzzzzz
zzzzzzzzz

Reputation: 57

This will match all your codes

/[A-Z]+/

Upvotes: -2

Related Questions