Reputation: 353

How to avoid groups of characters that might or not appear/interfere in a regex?

I'm experiencing a bit of a rippled carpet case with a regular expression.

The string in it's raw form that is being processed looks something like this:

1. [8] S Wawrinka (SUI) vs. A Golubev (KAZ)  2. D Sela (ISR) vs. J Nieminen (FIN)  3. S Giraldo (COL) vs. S Querrey (USA)  4. A Falla (COL) vs. M Kukushkin (KAZ)  5. I Karlovic (CRO) vs. [32] I Dodig (CRO)  6. [WC] S Johnson (USA) vs. A Mannarino (FRA)  7. [14] M Youzhny (RUS) vs. JL Struff (GER)  8. A Gonzalez (COL) vs. [3] D Ferrer (ESP)  9. [7] T Berdych (CZE) vs. A Nedovyesov (KAZ)  10. N Mahut (FRA) vs. M Ebden (AUS)H2H RR2*  11. [Q] D Thiem (AUT) vs. J Sousa (POR)  12. J Monaco (ARG) vs. [23] E Gulbis (LAT)  13. J Hajek (CZE) vs. [Q] D Dzumhur (BIH)

I'm not trying to make it as hard as possible to read but this is the exact output spit out from the HTML. What I'm trying to match is this (example from aforementioned output):

 S Wawrinka (SUI) vs. A Golubev (KAZ)

 I Karlovic (CRO) vs. I Dodig (CRO)

 J Hajek (CZE) vs. D Dzumhur (BIH)

Notice that in the last two I've had to do some clean-up of a few bracketed char groups.

So basically I want to have all of the records in this long string. (a record is being identified by having a vs., so if there are 12 vs.'s in that string there should be 12 records. (they are matched in the expected output on both sides of it so that part is not a worry of mine)

In the 3 examples I've shown I gave out examples of what I'm trying to ignore. The characters that might appear that I'm trying to avoid are put in a pair of brackets on whatever side of the vs and are either: ( [WC], [Q], [LL], [12], [1], [28] )

Things that never change:

vs. is guaranteed to be there for each record
the junk characters are always in brackets and appear before the either names
the overall records always keeps it's format

Something that might make the matching tricky is that the initial might be the same as one of the junk characters ( Q, W).

I've tried several expressions, pretty much all of them only achieve partial matching which is as good as none. Perhaps the most successful was:

       qr /
        ([A-Z]{1,2}   # Initials
        \s?
        [A-Za-z\']+   # Last name
        -?            # in case of hyphenated name
        \s?
        [A-Za-z\.]?   # two namer
        \s?
        \([A-Z]{3}\)  # country code
        \s?
        vs[.]?        # vs.
        \s?
        [^\]]\]?      # optional unwanted characters
        \s?
        [A-Z]{1,2}    
        \s?
        [A-Za-z\']+
        -?
        \s?
        [A-Za-z\.]*
        \s?
        \([A-Z]{3}\))
        /sx

I could pretty much match everything and then just clean up what I don't want but I want a one-go clean solution.

Upvotes: 1

Answers (3)

GWP

Reputation: 131

Quit trying to cram it into one statement. Since you don't care about the bracketed info, just clobber that with a substitute operation
s/\[.*?\]//g; first, and then split on /\d+\./

Upvotes: 0

perreal

Reputation: 97948

One way is getting rid of the distracting stuff:

# $t = "1. [8] S Waw....
my $re_name = qr/\b\w \w+ [(]\w+[)]/;
$t =~ s/\[[^\]]*\]//g; $t =~ s/ +/ /g;  # remove squared stuff
print "$1 vs. $2\n" while $t =~ /($re_name) vs[.] ($re_name)/g;

Upvotes: 1

Marius Schulz

Reputation: 16440

Let me suggest to you the following algorithm:

Split your string by the pattern [0-9]+\.. This will give you all the records. (You'll have to discard the first empty item.)
Split each item by the string vs.. This will give you both contestants.
Parse the name, nation, etc. of each contestant using a much simpler regex.

Upvotes: 1

How to avoid groups of characters that might or not appear/interfere in a regex?

Answers (3)

Related Questions