Reputation: 95
Using perl I have "slurped" in a large file that contains the text below and I am trying to capture all regex $1
matches within the file for my given regex. My regex is
=~ /((GET|PUT|POST|CONNECT).*?(Content-Type: (image\/jpeg)))/sgm
Currently the text in bold is being captured, however, the last capture is treating the lines
"GET /~sgtatham/putty/latest/x86/pscp.exe HTTP/1.1" to "Content-Type: text/html; charset=iso-8859-1"
as part of the very last capture and it should not b/c "text/html" is not equal to my regex capture of (image\/jpeg)
. I want to be able to capture the last capture without the
"GET /~sgtatham/putty/latest/x86/pscp.exe HTTP/1.1" to "Content-Type: text/html; charset=iso-8859-1" being included.
Appreciate any help, thank you.
**GET /~sgtatham/putty/latest/x86/pscp.exe HTTP/1.1
Host: the.earth.li
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
\.+"
GET /~sgtatham/putty/0.62/x86/pscp.exe HTTP/1.1
Host: the.earth.li
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Content-Length: 315392
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: image/jpeg**
Platform: Digital Engagement Platform; Version: 1.1.0.0
Upvotes: 3
Views: 744
Reputation: 3465
You can easy do it with (?!pattern)
, it's a negative look-ahead assertion.
For recap read this article Positive examples of positive and negative lookahead (ourcraft.wordpress.com)
Regular expression
$text =~ /
( # start capture
(?:GET|PUT|POST|CONNECT) # start phrase
(?:
(?!GET|PUT|POST|CONNECT) # make sure we'havent any these phrase
. # accept any character
)*? # any number of times (not greedy)
Content-Type:\simage\/jpeg # end phrase
) # end capture
/msx;
print $1;
All occurrences
while($text =~ m/REGEXP/msxg) {
print $1;
}
Output
GET /~sgtatham/putty/0.62/x86/pscp.exe HTTP/1.1
Host: the.earth.li
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Content-Length: 315392
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: image/jpeg
Upvotes: 3