Reputation:
I have an HTML file that I'm trying to extract data from. The regex I'm using is
"<tr.+?>.+?<td class=\"table_row_col2\"><b>(.+?)&.+?</b>.+?<td class=\"table_row_col5\">(.+?)</td>.+?<td class=\"table_row_col6\">(.+?)</td>.+?</tr>"
It works in Python but not in C#. Here's some sample data:
<tr class="table_row" style="background-color: #d3d3d3;">
<td class="table_row_col1">271</td>
<td class="table_row_col2"><b>16/09/2015 05:28 PM</b></font></small></sup></td>
<td class="table_row_col3"><span style="color:#e30613">14.3</span></td>
<td class="table_row_col4">-</td>
<td class="table_row_col5">8</td>
<td class="table_row_col6">-</td>
<td class="table_row_col7">-</td>
<td class="table_row_col8">Before dinner</td>
<td class="table_row_col9">-</td>
<td class="table_row_col10">-</td>
<td class="table_row_col11">-</td>
</tr>
<tr class="table_row" style="background-color: #ffffff;">
<td class="table_row_col1">272</td>
<td class="table_row_col2"><b>16/09/2015 02:54 PM</b></font></small></sup></td>
<td class="table_row_col3"><span style="color:#e30613">17.6</span></td>
<td class="table_row_col4">-</td>
<td class="table_row_col5">20</td>
<td class="table_row_col6">32</td>
<td class="table_row_col7">-</td>
<td class="table_row_col8">Other</td>
<td class="table_row_col9">-</td>
<td class="table_row_col10">-</td>
<td class="table_row_col11">-</td>
</tr>
<tr class="table_row" style="background-color: #d3d3d3;">
<td class="table_row_col1">273</td>
<td class="table_row_col2"><b>15/09/2015 11:09 PM</b></font></small></sup></td>
<td class="table_row_col3">-</td>
<td class="table_row_col4">-</td>
<td class="table_row_col5">-</td>
<td class="table_row_col6">34</td>
<td class="table_row_col7">-</td>
<td class="table_row_col8">Before Bed</td>
<td class="table_row_col9">-</td>
<td class="table_row_col10">-</td>
<td class="table_row_col11">-</td>
</tr>
I'm trying to extract the date from table_row_col2 and the numbers from table_row_col5 and table_row_col6
Upvotes: 1
Views: 210
Reputation: 2107
If you know the HTML never changes you can do it like this adding a class Split:
List<string> rows = Split.Extract(htmlString, "class=\"table_row\"", "</tr>");
foreach (string row in rows)
{
string col2 = Split.Extract(row, "class=\"table_row_col2\"><b>", "</b>")[0];
string col5 = Split.Extract(row, "class=\"table_row_col5\">", "</td>")[0];
string col6 = Split.Extract(row, "class=\"table_row_col6\">", "</td>")[0];
Console.WriteLine(col2 + ", " + col5 + ", " + col6);
}
Additional Class Split
:
public class Split
{
public static List<string> Extract(string source, string splitStart, string splitEnd)
{
try
{
var results = new List<string>();
string[] start = new string[] { splitStart };
string[] end = new string[] { splitEnd };
string[] temp = source.Split(start, StringSplitOptions.None);
for (int i = 1; i < temp.Length; i++)
{
results.Add(temp[i].Split(end, StringSplitOptions.None)[0]);
}
return results;
}
catch (Exception e)
{
throw new Exception(e.Message);
}
}
}
Upvotes: 1