atari400
atari400

Reputation: 91

Get text from a HTML web document .NET

Basically what I want to do is get text from a HTML web document,

<a href="showthread.php?tid=2632829">1</a> 
<a href="showthread.php?tid=2342818">1</a> 
<a href="showthread.php?tid=2342818">1</a> 
<a href="showthread.php?tid=2342818">1</a> 
....
....

All these link are in different lines and a lot of other scripts in between them. right now the catch is I want to search for "1</a>" in these documents and get the link

showthread.php?tid=11digitnumber 

I then want to place them in a richtextbox line by line say

    showthread.php?tid=11digitnumber
    showthread.php?tid=11digitnumber
    showthread.php?tid=11digitnumber

...

What I have done so far is got the source of webpage using

source = WebBrowser1.DocumentText.ToString()

Earlier I had some luck using

dim ss,variable as string

variable = ss.Substring(ss.LastIndexOfAny(">1</a> ") - 27, 27)
output:
showthread.php?tid=11digitnumber

but I am only able to use this once,besides there are many such files in the document

Upvotes: 1

Views: 686

Answers (1)

Jorge Alvarado
Jorge Alvarado

Reputation: 2674

you just have to play with a bit of logic like:

myOriginPoint = your starting point (usually 0)

myLastOccurrence = your last point (usually with LastIndexOf)

then you can use a loop and a temporal list like:

List<String> urls = new List<String>();

while(myOriginPoint < myLastOccurrence )
{

    //retrieve the keyword
    var urlFound = your logic to retrieve the url

    //save the keyword 
    urls.Add(urlFound);

    //move to next position 

    myOriginPoint = indexOf  +1;

}

By the way, you can also use WebClient in .Net, si much better to retrieve data from a url: http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

I hope it helps,

Upvotes: 1

Related Questions