Bill Gammel
Bill Gammel

Reputation: 21

How to get wget to accept files with no suffix

I'm using wget (from perl) to get web pages from a site. I'm really only interested in the html,htm,php,asp,aspx file types. However, at least one site has supplied links using file names with no extensions/suffix. I need those too.

My:

wget -A html,htm,php,asp,aspx

works great, except for the no suffix links.

I've tried a number of regex strings to try and get the no suffix pages, but to no avail. wget returns just the main page. So far, the only way to get these files is to open it up to all files (which isn't terrible for this website, but would be terrible for others).

Is there either a regex or regular way to specify I want links from wget with no suffixes?

Upvotes: 2

Views: 1880

Answers (1)

krisku
krisku

Reputation: 3993

wget version 1.14 seems to support a --accept-regex argument which is matched against the full URL, i.e. something like the following should in theory work (untested):

wget --accept-regex '/[^.]+(?:\.(?:html?|php|aspx?))?$'

Or perhaps it would be easier to just reject those extensions you do not want?

Upvotes: 1

Related Questions