user1175245
user1175245

Reputation: 11

C# regex data from website

I am trying to make an addon to a game named Tibia.

On their website Tibia.com you can search up people and see their deaths.

forexample:

http://www.tibia.com/community/?subtopic=characters&name=Kixus

Now I want to read the deaths data by using Regex in my C# application.

But I cannot seem to work it out, I've been spending hours and hours on

http://myregextester.com/index.php

The expression I use is :

<tr bgcolor=(?:"#D4C0A1"|"#F1E0C6") ><td width="25%" valign="top" >(.*?)?#160;CET</td><td>((?:Died|Killed) at Level ([^ ]*)|and) by (?:<[^>]*>)?([^<]*).</td></tr>

But I cannot make it work.

I want the Timestamp, creature / player Level, and creature / player name

Thanks in advance.

-Regards

Upvotes: 1

Views: 608

Answers (4)

sga101
sga101

Reputation: 1904

As suggested by Joe White, you would have a much more robust implementation if you use an HTML parser for this task. There is plenty of support for this on StackOverflow: see here for example.

If you really have to use regexs

I would recommend breaking your solution down into simpler regexs which can be applied using a top down parsing approach to get the results.

For example:

  1. use a regex on the whole page which matches the character table

    I would suggest matching the shortest unique string before and after the table rather than the table itself, and capturing the table using a group, since this avoids having to deal with the possibility of nested tables.

  2. use a regex on the character table that matches table rows

  3. use a regex on the first cell to match the date
  4. use a regex on the second cell to match links
  5. use a regex on the second cell to match the players level
  6. use a regex on the second cell to match the killers name if it was a creature (there are no links in the cell)

This will be much more maintainable if the site changes its Html structure significantly.

A complete working implementation using HtmlAgilityKit

You can dowload the library from the HtmlAgilityKit site on CodePlex.

// This class is used to represent the extracted details
public class DeathDetails
{
    public DeathDetails()
    {
        this.KilledBy = new List<string>();
    }

    public string DeathDate { get; set; }
    public List<String> KilledBy { get; set; }
    public int PlayerLevel { get; set; }
}

public class CharacterPageParser
{
    public string CharacterName { get; private set; }

    public CharacterPageParser(string characterName)
    {
        this.CharacterName = characterName;
    }

    public List<DeathDetails> GetDetails()
    {
        string url = "http://www.tibia.com/community/?subtopic=characters&name=" + this.CharacterName;
        string content = GetContent(url);

        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(content);

        HtmlNodeCollection tables = document.DocumentNode.SelectNodes("//div[@id='characters']//table");

        HtmlNode table = GetCharacterDeathsTable(tables);
        List<DeathDetails> deaths = new List<DeathDetails>();

        for (int i = 1; i < table.ChildNodes.Count; i++)
        {
            DeathDetails details = BuildDeathDetails(table, i);
            deaths.Add(details);
        }
        return deaths;
    }

    private static string GetContent(string url)
    {
        using (System.Net.WebClient c = new System.Net.WebClient())
        {
            string content = c.DownloadString(url);
            return content;
        }
    }

    private static DeathDetails BuildDeathDetails(HtmlNode table, int i)
    {
        DeathDetails details = new DeathDetails();

        HtmlNode tableRow = table.ChildNodes[i];

        //every row should have two cells in it
        if (tableRow.ChildNodes.Count != 2)
        {
            throw new Exception("Html format may have changed");
        }

        HtmlNode deathDateCell = tableRow.ChildNodes[0];
        details.DeathDate = System.Net.WebUtility.HtmlDecode(deathDateCell.InnerText);

        HtmlNode deathDetailsCell = tableRow.ChildNodes[1];
        // get inner text to parse for player level and or creature name
        string deathDetails = System.Net.WebUtility.HtmlDecode(deathDetailsCell.InnerText);

        // get player level using regex
        Match playerLevelMatch = Regex.Match(deathDetails, @" level ([\d]+) ", RegexOptions.IgnoreCase);
        int playerLevel = 0;
        if (int.TryParse(playerLevelMatch.Groups[1].Value, out playerLevel))
        {
            details.PlayerLevel = playerLevel;
        }

        if (deathDetailsCell.ChildNodes.Count > 1)
        {
            // death details contains links which we can parse for character names

            foreach (HtmlNode link in deathDetailsCell.ChildNodes)
            {
                if (link.OriginalName == "a")
                {
                    string characterName = System.Net.WebUtility.HtmlDecode(link.InnerText);
                    details.KilledBy.Add(characterName);
                }
            }
        }
        else
        {
            // player was killed by a creature - capture creature name
            Match creatureMatch = Regex.Match(deathDetails, " by (.*)", RegexOptions.IgnoreCase);
            string creatureName = creatureMatch.Groups[1].Value;
            details.KilledBy.Add(creatureName);
        }
        return details;
    }

    private static HtmlNode GetCharacterDeathsTable(HtmlNodeCollection tables)
    {
        foreach (HtmlNode table in tables)
        {
            // Get first row
            HtmlNode tableRow = table.ChildNodes[0];

            // check to see if contains enough elements
            if (tableRow.ChildNodes.Count == 1)
            {
                HtmlNode tableCell = tableRow.ChildNodes[0];
                string title = tableCell.InnerText;

                // skip this table if it doesn't have the right title
                if (title == "Character Deaths")
                {
                    return table;
                }
            }
        }

        return null;
    }

And an example of it in use:

 CharacterPageParser kixusParser = new CharacterPageParser("Kixus");

        foreach (DeathDetails details in kixusParser.GetDetails())
        {
            Console.WriteLine("Player at level {0} was killed on {1} by {2}", details.PlayerLevel, details.DeathDate, string.Join(",", details.KilledBy));
        }

Upvotes: 1

Joe White
Joe White

Reputation: 97808

It's a bad idea to use regular expressions to parse HTML. They're a very poor tool for the job. If you're parsing HTML, use an HTML parser.

For .NET, the usual recommendation is to use the HTML Agility Pack.

Upvotes: 2

Royi Namir
Royi Namir

Reputation: 148694

try this :

http://jsbin.com/atupok/edit#javascript,html

and continue from there .... I did the most job here :)

edit

http://jsbin.com/atupok/3/edit

and start using this tool

http://regexr.com?2vrmf

not the one you have.

Upvotes: 0

Andrii Kalytiiuk
Andrii Kalytiiuk

Reputation: 1497

You can also use Espresso tool to work out proper regular expression.

To properly escape all special characters that are not parts of regular expression you can use Regex.Escape method:

string escapedText = Regex.Escape("<td width=\"25%\" valign=\"top\" >");

Upvotes: 0

Related Questions