Reputation: 353
I'm experiencing a bit of a rippled carpet case with a regular expression.
The string in it's raw form that is being processed looks something like this:
1. [8] S Wawrinka (SUI) vs. A Golubev (KAZ) 2. D Sela (ISR) vs. J Nieminen (FIN) 3. S Giraldo (COL) vs. S Querrey (USA) 4. A Falla (COL) vs. M Kukushkin (KAZ) 5. I Karlovic (CRO) vs. [32] I Dodig (CRO) 6. [WC] S Johnson (USA) vs. A Mannarino (FRA) 7. [14] M Youzhny (RUS) vs. JL Struff (GER) 8. A Gonzalez (COL) vs. [3] D Ferrer (ESP) 9. [7] T Berdych (CZE) vs. A Nedovyesov (KAZ) 10. N Mahut (FRA) vs. M Ebden (AUS)H2H RR2* 11. [Q] D Thiem (AUT) vs. J Sousa (POR) 12. J Monaco (ARG) vs. [23] E Gulbis (LAT) 13. J Hajek (CZE) vs. [Q] D Dzumhur (BIH)
I'm not trying to make it as hard as possible to read but this is the exact output spit out from the HTML. What I'm trying to match is this (example from aforementioned output):
S Wawrinka (SUI) vs. A Golubev (KAZ)
or
I Karlovic (CRO) vs. I Dodig (CRO)
or
J Hajek (CZE) vs. D Dzumhur (BIH)
Notice that in the last two I've had to do some clean-up of a few bracketed char groups.
So basically I want to have all of the records in this long string. (a record is being identified by having a vs., so if there are 12 vs.'s in that string there should be 12 records. (they are matched in the expected output on both sides of it so that part is not a worry of mine)
In the 3 examples I've shown I gave out examples of what I'm trying to ignore. The characters that might appear that I'm trying to avoid are put in a pair of brackets on whatever side of the vs and are either: ( [WC], [Q], [LL], [12], [1], [28] )
Things that never change:
Something that might make the matching tricky is that the initial might be the same as one of the junk characters ( Q, W).
I've tried several expressions, pretty much all of them only achieve partial matching which is as good as none. Perhaps the most successful was:
qr /
([A-Z]{1,2} # Initials
\s?
[A-Za-z\']+ # Last name
-? # in case of hyphenated name
\s?
[A-Za-z\.]? # two namer
\s?
\([A-Z]{3}\) # country code
\s?
vs[.]? # vs.
\s?
[^\]]\]? # optional unwanted characters
\s?
[A-Z]{1,2}
\s?
[A-Za-z\']+
-?
\s?
[A-Za-z\.]*
\s?
\([A-Z]{3}\))
/sx
I could pretty much match everything and then just clean up what I don't want but I want a one-go clean solution.
Upvotes: 1
Views: 98
Reputation: 131
Quit trying to cram it into one statement.
Since you don't care about the bracketed info, just clobber that with a substitute operation
s/\[.*?\]//g;
first, and then split on
/\d+\./
Upvotes: 0
Reputation: 97948
One way is getting rid of the distracting stuff:
# $t = "1. [8] S Waw....
my $re_name = qr/\b\w \w+ [(]\w+[)]/;
$t =~ s/\[[^\]]*\]//g; $t =~ s/ +/ /g; # remove squared stuff
print "$1 vs. $2\n" while $t =~ /($re_name) vs[.] ($re_name)/g;
Upvotes: 1
Reputation: 16440
Let me suggest to you the following algorithm:
[0-9]+\.
. This will give you all the records. (You'll have to discard the first empty item.)vs.
. This will give you both contestants.Upvotes: 1