Reputation: 111

Download HTML file and convert it to TXT

I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.

Upvotes: 3

Answers (5)

Gokul

Reputation: 831

using (WebClient client = new WebClient()) 
{
   client.DownloadFile("http://example.com", @"D:\filename.txt");
}

Upvotes: 0

AndyD273

Reputation: 7279

In visual basic this works:

Imports System
Imports System.IO
Imports System.Net

Function MakeRequest(ByVal url As String) As String
    Dim request As WebRequest = WebRequest.Create(url)
    ' If required by the server, set the credentials. '
    request.Credentials = CredentialCache.DefaultCredentials
    ' Get the response. '
    Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
    ' Get the stream containing content returned by the server. '
    Dim dataStream As Stream = response.GetResponseStream()
    ' Open the stream using a StreamReader for easy access. '
    Dim reader As New StreamReader(dataStream)
    Dim text As String = reader.ReadToEnd

    Return text
End Function

Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.

Upvotes: 1

Tigran

Reputation: 62248

Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this

RegEx match open tags except XHTML self-contained tags

Use instead already implemented HTML parsers for this purpose.

Here is another discussion on SO where you can find a links you need

Looking for C# HTML parser

Search also on internet by yourself.

Upvotes: 0

asfallows

Reputation: 6108

You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream and StreamReader should be sufficient to read the file.

If you really want the file to be a .txt, you can simply save the file as filename.txt instead of filename.html when you download it.

Upvotes: 1

userx

Reputation: 3815

It sounds like you want to remove all the HTML tags and then search the resulting text.

My first reaction was to use a Regular Expression:

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

Shamelessly stole this from: Using C# regular expressions to remove HTML tags

Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.

Upvotes: 2

Download HTML file and convert it to TXT

Answers (5)

Related Questions