Reputation: 111
I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.
Upvotes: 3
Views: 2657
Reputation: 831
using (WebClient client = new WebClient())
{
client.DownloadFile("http://example.com", @"D:\filename.txt");
}
Upvotes: 0
Reputation: 7279
In visual basic this works:
Imports System
Imports System.IO
Imports System.Net
Function MakeRequest(ByVal url As String) As String
Dim request As WebRequest = WebRequest.Create(url)
' If required by the server, set the credentials. '
request.Credentials = CredentialCache.DefaultCredentials
' Get the response. '
Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
' Get the stream containing content returned by the server. '
Dim dataStream As Stream = response.GetResponseStream()
' Open the stream using a StreamReader for easy access. '
Dim reader As New StreamReader(dataStream)
Dim text As String = reader.ReadToEnd
Return text
End Function
Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.
Upvotes: 1
Reputation: 62248
Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this
RegEx match open tags except XHTML self-contained tags
Use instead already implemented HTML parsers for this purpose.
Here is another discussion on SO where you can find a links you need
Search also on internet by yourself.
Upvotes: 0
Reputation: 6108
You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream
and StreamReader
should be sufficient to read the file.
If you really want the file to be a .txt, you can simply save the file as filename.txt
instead of filename.html
when you download it.
Upvotes: 1
Reputation: 3815
It sounds like you want to remove all the HTML tags and then search the resulting text.
My first reaction was to use a Regular Expression:
String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);
Shamelessly stole this from: Using C# regular expressions to remove HTML tags
Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.
Upvotes: 2