Reputation: 4869
I some html response from service
<style> .transcription, .trsc{line-height:19px; padding-left:20px; font-family:Lucida Sans Unicode; padding-right:5px;} </style><div id="shView"> <div class="cforms_result" id="cforms_result1"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('1', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('1', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('1', 'wordER');"> Спряжение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Infinitive</span><b>mother</b><br><span class="w_des">Past Indefinite</span><b>mothered</b><br><span class="w_des">Participle II</span><b>mothered</b><br><span class="w_des">Participle I</span><b>mothering</b></span></wrs></span> <span class="ref_info"></span>, <span class="ref_psp">Глагол</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr1" onclick="javascript:ListenWord(this,'mother',1,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr1"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv1"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on1"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('1', 'wordER');"> <ol> <li><span class="ref_result">относиться по-матерински<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info"></span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div class="cforms_result" id="cforms_result2"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('2', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('2', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('2', 'wordER');"> Склонение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Singular</span><b>mother</b><br><span class="w_des">Plural</span><b>mothers</b></span></wrs></span> <span class="ref_info"></span>, <span class="ref_psp">Существительное</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr2" onclick="javascript:ListenWord(this,'mother',2,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr2"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv2"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on2"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('2', 'wordER');"> <ol> <li><span class="ref_result">мать<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">f</span></li> <li><span class="ref_result">родительский элемент<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">m</span><span class="ref_dictionary"> (ИТ - базовый) </span></li> <li><span class="ref_result">родительский<wrs><span class="sforms_src"></span></wrs></span><span class="ref_comment"> (attributive) </span> <span class="ref_info"></span><span class="ref_dictionary"> (ИТ - базовый) </span></li> <li><span class="ref_result">прототип<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">m</span><span class="ref_dictionary"> (Политехнический) </span></li> <li><span class="ref_result">начало<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">n</span><span class="ref_dictionary"> (Политехнический) </span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div class="cforms_result" id="cforms_result3"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('3', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('3', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('3', 'wordER');"> Склонение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Positive</span><b>mother</b><br></span></wrs></span> <span class="ref_info"></span>, <span class="ref_psp">Прилагательное</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr3" onclick="javascript:ListenWord(this,'mother',3,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr3"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv3"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on3"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('3', 'wordER');"> <ol> <li><span class="ref_result">родительский<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info"></span><span class="ref_dictionary"> (ИТ - базовый) </span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div id="fullRLink"><a href="javascript:GetFullWordCBK('1', 'wordER');">Показать полную словарную статью</a><span id="al_fullWR"><img src="/images/common/al_fullWR.gif" align="middle" hspace="10"> Загружаем...</span></div></div>
I want to get text between this pattern <span class="ref_result">TEXT<wrs>
I use this code for get all matching
const string pattern = "ref_result\">\\w+<";
Regex rgx = new Regex(pattern, RegexOptions.Compiled);
var text = SantinizeOutput(result.result);
MatchCollection matches = rgx.Matches(text);
if(matches.Count > 0)
{
resultsList = new List<string>(matches.Count);
foreach(Match match in rgx.Matches(text))
{
string formattedWord = match.Value;
int leftAngleBracketIndex = formattedWord.IndexOf(">");
var word = formattedWord.Remove(0, leftAngleBracketIndex + 1);
word = word.TrimEnd('<');
resultsList.Add(word);
}
}
private string SantinizeOutput(string input)
{
var text = input.Replace("\n", "").Replace("\r", "");
return Regex.Replace(text, "\\s+", " ");
}
In this text, there are 7 of these matches, but in result only 5.
Where I made a mistake?
Upvotes: 2
Views: 156
Reputation: 11602
By changing your regex, you can also remove some logic in your code.
const string pattern = "ref_result\">([^<]*)";
Regex rgx = new Regex(pattern, RegexOptions.Compiled);
var text = SantinizeOutput(result.result);
MatchCollection matches = rgx.Matches(text);
List<string> resultsList = new List<string>(matches.Count);
for(int i=0; i<resultsList.Length; i++) {
resultsList.Add(matches[i].Groups[1].Value);
}
private string SantinizeOutput(string input) {
var text = input.Replace("\n", "").Replace("\r", "");
return Regex.Replace(text, "\\s+", " ");
}
Upvotes: 0
Reputation: 179422
\w
means 'word characters'; it does not match spaces. Observe that two of the ref_result
tags contain spaces:
<span class="ref_result">относиться по-матерински<wrs>
<span class="ref_result">родительский элемент<wrs>
Just use "ref_result\">[^<]+<wrs"
to get all non-tag content.
Upvotes: 3
Reputation: 14233
Try changing your \w to .*?
So:
const string pattern = "ref_result\">.*?<";
.*? will get all characters (in a non-greedy way) until it hits the first < character.
.* will get all characters (in a greedy way) until it hits the last < character. You will want to use the non-greedy method.
Upvotes: 2