Reputation: 235
In each line i want to parse the string after the tag
<li>602 — <a href="/w/index.php?title=Text602&action=edit&redlink=1" class="new" title="Text602 (page does not exist)">Text602</a> document</li>
<li>ABW — <a href="/wiki/AbiWord" title="AbiWord">AbiWord</a> Document</li>
I want to parse the 602 from the first line and the ABW from the second line. What i tried to do is:
private void ParseFilesTypes()
{
string[] lines = File.ReadAllLines(@"E:\New folder (44)\New Text Document.txt");
foreach (string str in lines)
{
int r = str.IndexOf("<li>");
if (r >= 0)
{
int i = str.IndexOf(" -", r + 1);
if (i >= 0)
{
int c = str.IndexOf(" -", i + 1);
if (c >= 0)
{
i++;
MessageBox.Show(str.Substring(i, c - i));
}
}
}
}
}
But c is all the time -1
Upvotes: 1
Views: 79
Reputation: 30022
Actually, your problem is that you're reading the file with the incorrect encoding. You have a special character in your file —
and not -
. So you need to correct this character in your code and read the file in the correct encoding. If you debug your string read with wrong encoding, you'll see a black diamond instead of —
.
Also, you need to remove the space before —
or replace i + 1
with i
;
private static void ParseFilesTypes()
{
string sampleFilePath = @"log.txt";
string[] lines = File.ReadAllLines(@"log.txt", Encoding.GetEncoding("windows-1252"));
foreach (string str in lines)
{
int r = str.IndexOf("<li>");
if (r >= 0)
{
int i = str.IndexOf(" —", r + 1);
if (i >= 0)
{
int c = str.IndexOf(" —", i);
if (c >= 0)
{
i++;
int startIndex = r + "<li>".Length;
int length = i - startIndex - 1;
string result = str.Substring(r + "<li>".Length, length);
MessageBox.Show(result);
}
}
}
}
}
Upvotes: 2
Reputation: 3373
I think it is a case when regex would be useful (unless there will be no li
attributes):
var regex = new Regex("^<li>(.+) —");
foreach (string str in lines)
{
var m = regex.Match(str);
if (m.Success)
MessageBox.Show(m.Groups[1].Value);
}
Upvotes: 2