Reputation: 45

Parsing url from txt file

I am trying to parse a txt file which looks like this:

Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

I need to read the file and extract the part with the url after 'Disallow' but also ignoring the comments. Thanks in advance.

Upvotes: 0

Answers (2)

Reputation: 40973

If you are trying to parse a robots.txt file then you should use the robotparser module:

>>> import robotparser

>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.your_url.com/robots.txt")
>>> r.read()

Then just check:

>>> r.can_fetch("*", "/foo.html")
False

Upvotes: 5

Reputation: 113975

Assuming that there's no # in the URLs:

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split("#", 1)[0] for line in infile]

Allowing for the existence of #, but assuming that comments beginning with # and the urls are separated by a space:

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split(" #", 1)[0] for line in infile]

Upvotes: 1