user2184697
user2184697

Reputation:

How can I get a database of valid URLs for my search engine?

I'm trying to make an Internet Search Engine for school, with no more than C# and the .NET framework. I need to download the HTML code of the pages I'm indexing.

Now all it takes is to have a list of valid URLs.

Since I don't have a database of valid URLs, I made a trial and error algorithm, which grows a string:

a, b, c.....
aa, ab, ac......
aaa, aab, aac......
aaaa, aaab, aaac......
aaaaa, aaaab, aaaac......

and then tries to concatenate with .com, .net or whatever. This is too inefficient.

I need a database with valid URLs. Do you know where I can get one?

I can't work out how to get them straight out of DNS - is this something that's possible?

Upvotes: 2

Views: 328

Answers (1)

Tass
Tass

Reputation: 1248

You can build your own. Most search engines crawl pages and follow links to other pages.

You start with a known list (it doesn't have to be very big) then:

  1. Access a page in your list
  2. Find links on those pages
  3. Add those links to your list
  4. Go to 1

As for using DNS; it's not designed to query URLs, only hostnames. And, as far as I know, you can't get a list of every hostname from a DNS server unless you manage the server yourself.

Upvotes: 2

Related Questions