user5852009
user5852009

Reputation:

How do I find specific matches using regex and put them in a string array?

I have an HTML file that I'm trying to extract data from. The regex I'm using is

"<tr.+?>.+?<td class=\"table_row_col2\"><b>(.+?)&.+?</b>.+?<td class=\"table_row_col5\">(.+?)</td>.+?<td class=\"table_row_col6\">(.+?)</td>.+?</tr>"

It works in Python but not in C#. Here's some sample data:

<tr class="table_row" style="background-color: #d3d3d3;">
    <td class="table_row_col1">271</td>
    <td class="table_row_col2"><b>16/09/2015&nbsp;05:28&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3"><span style="color:#e30613">14.3</span></td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">8</td>
    <td class="table_row_col6">-</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Before dinner</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

<tr class="table_row" style="background-color: #ffffff;">
    <td class="table_row_col1">272</td>
    <td class="table_row_col2"><b>16/09/2015&nbsp;02:54&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3"><span style="color:#e30613">17.6</span></td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">20</td>
    <td class="table_row_col6">32</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Other</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

<tr class="table_row" style="background-color: #d3d3d3;">
    <td class="table_row_col1">273</td>
    <td class="table_row_col2"><b>15/09/2015&nbsp;11:09&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3">-</td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">-</td>
    <td class="table_row_col6">34</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Before Bed</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

I'm trying to extract the date from table_row_col2 and the numbers from table_row_col5 and table_row_col6

Upvotes: 1

Views: 210

Answers (1)

M. Schena
M. Schena

Reputation: 2107

If you know the HTML never changes you can do it like this adding a class Split:

List<string> rows = Split.Extract(htmlString, "class=\"table_row\"", "</tr>");
foreach (string row in rows)
{
    string col2 = Split.Extract(row, "class=\"table_row_col2\"><b>", "</b>")[0];
    string col5 = Split.Extract(row, "class=\"table_row_col5\">", "</td>")[0];
    string col6 = Split.Extract(row, "class=\"table_row_col6\">", "</td>")[0];

    Console.WriteLine(col2 + ", " + col5 + ", " + col6);
}

Additional Class Split:

public class Split
{
    public static List<string> Extract(string source, string splitStart, string splitEnd)
    {
        try
        {
            var results = new List<string>();

            string[] start = new string[] { splitStart };
            string[] end = new string[] { splitEnd };
            string[] temp = source.Split(start, StringSplitOptions.None);

            for (int i = 1; i < temp.Length; i++)
            {
                results.Add(temp[i].Split(end, StringSplitOptions.None)[0]);
            }

            return results;
        }
        catch (Exception e)
        {
            throw new Exception(e.Message);
        }
    }
}

Upvotes: 1

Related Questions