Reputation:
I have data like this:
<td><a href="/New_York_City" title="New York City">New York</a></td>
And I would like to get New York out of it.
I don't have any skill in regex what so ever. I have tried this though:
StreamReader sr = new StreamReader("c:\\USAcityfile2.txt");
string pattern = "<td>.*</td>";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Regex r1 = new Regex("<a .*>.*</a>", RegexOptions.IgnoreCase);
string read = "";
while ((read = sr.ReadLine()) != null)
{
foreach (Match m in r.Matches(read))
{
foreach (Match m1 in r1.Matches(m.Value.ToString()))
Console.WriteLine(m1.Value);
}
}
sr.Close();
sr.Dispose();
this gave me <a href="/New_York_City" title="New York City">New York</a>
.
How can reach to data between <a .*>
and </a>
? thanks.
Upvotes: 0
Views: 216
Reputation: 114721
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes("a");
// or var nodes = doc.DocumentNode.SelectNodes("//td/a") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
string city = node.InnerText;
}
// or var linkTexts = nodes.Select(node => node.InnerText);
Upvotes: 0
Reputation: 56172
As per OP comment, that input document is HTML, it'd be better to use HTML parser, e.g.: Html Agility Pack. You can use XPath //td/a
to obtain desired result.
Upvotes: 0
Reputation: 92986
If you insist on a regex for this particular case, then try this:
String pattern = @"(?<=<a[^>]*>).*?(?=</a>)
(?<=<a[^>]*>)
is a positive lookbehind assertion to ensure that there is <a[^>]*>
before the wanted pattern.
(?=</a>)
is a positive lookahead assertion to ensure that there is </a>
after the pattern
.*?
is a lazy quantifier, matching as less as possible till the first </a>
A good reference for regular expressions is regular-expressions.info
Upvotes: 1
Reputation: 12226
var g = Regex.Match(s, @"\<a[^>]+\>([^<]*)").Groups[1];
To find all values of <a>
in your file you may use the following (easier) code:
var allValuesOfAnchorTag =
from line in File.ReadLines(filename)
from match in Regex.Matches(line, @"\<a[^>]+\>([^<]*)").OfType<Match>()
let @group = match.Groups[1]
where @group.Success
select @group.Value;
However you seem to work with XML as @kirill-polishchuk correctly pointed out. If that is true code is even more simple:
var values = from e in XElement.Load(filename).Descendants("a")
select e.Value;
Upvotes: 0
Reputation:
foreach (Match m1 in r1.Matches(m.Value.ToString()))
{
//Console.WriteLine(m1.Value);
string[] res = m1.Value.Split(new char[] {'>','<'});
Console.WriteLine(res[2]);
}
Did the trick, for this particular example. Still not what I am looking.
Upvotes: 0