Yorak Hunt
Yorak Hunt

Reputation: 125

Regex to extract info out of large html source?

in among lots of html source i have some elements like this

<option value=15>Bahrain - Manama</option>
<option value=73>Bangladesh - Dhaka</option>
<option value=46>Barbados - Bridgetown</option>
<option value=285>Belarus - Minsk</option>
<option value=48>Belgium - Brussels</option>
<option value=36>Belize - Belmopan</option>

Also I have a dictionary declared like Dictionary<string, int> Places = new Dictionary<string, int>();

What I want to do it extract the City name out of the html and put it into of Places, and extract the number code out and put it into the int. For the first one I would add Placed.Add("Manama", 15); The country name can get ignored. The idea though is to scan the html source and add the Cities automatically.

this is what I have so far

string[] temp = htmlContent.Split('\n');
List<string> temp2 = new List<string>();
foreach (string s in temp)
{
    if (s.Contains("<option value="))
    {
        string t = s.Replace("option value=", ""); 
        temp2.Add(t); 
    }
}

This cuts out some of the text but then I more or less get stuck wondering how to extract the relevant parts from the text. It's really bad I know but I'm learning :(

Upvotes: 0

Views: 126

Answers (2)

BrokenGlass
BrokenGlass

Reputation: 160852

Don't use a regular expression - use HtmlAgilityPack - now you can use Linq to retrieve your option elements and build up your dictionary in a one-liner:

HtmlDocument doc = new HtmlDocument();
//remove "option" special handling otherwise inner text won't be parsed correctly
HtmlNode.ElementsFlags.Remove("option"); 
doc.Load("test.html");

var Places = doc.DocumentNode
                .Descendants("option")
                .ToDictionary(x => x.InnerText.Split('-')[1].Trim(),
                              x => x.Attributes["value"].Value);

For extracting the city name from the option value the above uses string.Split(), splitting on the separating -, taking the second (city) string and trimming any leading or trailing whitespace.

Upvotes: 4

Ryan Durrant
Ryan Durrant

Reputation: 1018

If the only relevant data you are looking for is within

string[] options = Regex.Split(theSource, "<option value="); // Splits up the source which is downloaded from the url

that way you are instantly faced with an array of strings with the first few chars being your int. if the ints are always over 10, i.e 2 characters long, you can use:

int y = 2; // pointer
string theString = options[x].substring(0,2); // if the numbers are always > 10 its quicker than a loop otherwise leave this bit out and loop the is below
if(options[x].substring(y,1)!=">") // check to see if the number has finished
{
    theString += options[x].substring(y,1);
    y++;
}
int theInt = int.Parse(theString);

to get the number you can loop the if statement with a pointer if you need to get longer numbers. If the numbers are not always over 10, just loop the if statement with a pointer and ignore the first lines.

Then I would re-use the string theString:

string[] place = Regex.Split(options[x], " - "); // split it immediately after the name
theString = place[0].substring(y, place[0].length - y); 

And then add them with

Places.Add(theString, theInt);

Shoud work, if the code doesnt work straigth away, the algorithms will, just make sure the spelling is right and that the variables are doing what they should

Upvotes: 0

Related Questions