StevenWang
StevenWang

Reputation: 3816

HTML parser...My recent project needs a web spider

HTML parser...My recent project needs a web spider..it automatically get web content which it gets the links recursively.... But, it needs to know its content exactly. like tag. it runs in linux and windows..do you know some opensource about this needs.. thanx or about some suggestion.

Upvotes: 0

Views: 483

Answers (3)

David Claridge
David Claridge

Reputation: 6329

Depends what language you are developing for, trying googling:

html parser languagename

hpricot is a good one for Ruby, for example.

Upvotes: 0

Chris Lutz
Chris Lutz

Reputation: 75389

Here is a StackOverflow question showing how to use a number of XML/HTML parsers in different languages. If you tell us what language you're using, I can be more specific, but your answer may already be in there.

Upvotes: 3

xrath
xrath

Reputation: 854

I think the subject you need to know is Regular Expression.

Regular Expression is available on all platform and all languages (Java, PHP, Python, C#, Ruby, Javascript). Using Regular Expression, you can easily exact its content as preferred form you want.

Pattern p = Pattern.compile("<a\\s[^>]*href=\"([^\"]+?)\"[^>]*>");
Matcher m = p.matcher(pageContent);
while( m.find() ) { 
  System.out.println( m.group(1) );
}

Above code block written in Java will extract all anchor tags in a page and extract URL into your hand.

If you don't have enough time to learn Regular Expression, the following references will help you.

http://htmlparser.sourceforge.net/

Upvotes: -1

Related Questions