Reputation: 171
I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423
and 1234523453245
but only, if there is string Allan
between:
<h4><a href="/Profile/23423423423.html">@@@@@@</a> </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(@@@@@@@@@@,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+)
. But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
Upvotes: 1
Views: 70
Reputation: 43199
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com
Upvotes: 0
Reputation: 4052
You can construct a character group to match any character including newlines by using [\S\s]
. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s
) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
Upvotes: 2
Reputation: 2621
You can use m
to specify .
to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Upvotes: 0