aeupinhere
aeupinhere

Reputation: 2983

Getting a list of files on a web server

All,

I would like to get a list of files off of a server with the full url in tact. For example, I would like to get all the TIFFs from here.

http://hyperquad.telascience.org/naipsource/Texas/20100801/*

I can download all the .tif files with wget but I am looking for is just the full url to each file like this.

http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_3_20100424.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_1_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif

Any thoughts on how to get all these files in to a list using something like curl or wget?

Adam

Upvotes: 6

Views: 70040

Answers (5)

OldManSam
OldManSam

Reputation: 39

I have a client-server system that retrieves the file names from an assigned folder in the app server's folder, then displays thumbnails in the client. CLIENT: (slThumbnailNames is a string list) == on the server side === A TIDCmdTCPServer has a CommandHandler GetThumbnailNames (a commandhandler is a procedure)

Hints: sMFFBServerPictures is generated in the oncreate method of the app server. sThumbnailDir is passed to the app server from the client.

`slThumbnailNames := funGetThumbnailNames(sThumbNailPath);
function TfMFFBClient.funGetThumbnailNames(sThumbnailPath:string):TStringList;
var
  slThisStringList:TStringList;
begin
  slThisStringList := TStringList.Create;
  dmMFFBClient.tcpMFFBClient.SendCmd('GetThumbnailNames,' + sThumbnailPath,700);
  dmMFFBClient.tcpMFFBClient.IOHandler.Capture(slThisStringList);
  result := slThisStringList;
end;

procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames(
  ASender: TIdCommand);
var
  sRec:TSearchRec;
  sThumbnailDir:string;
  i,iNumFiles: Integer;
begin
  try
    ASender.Response.Clear;
    sThumbnailDir := ASender.Params[0];
    iNumFiles := FindFirst(sMFFBServerPictures + sThumbnailDir + '*_t.jpg', faAnyfile, SRec );
    if iNumFiles = 0 then
    try
      ASender.Response.Add(SRec.Name);

      while iNumFiles = 0 do
      begin
        if (SRec.Attr and faDirectory <> faDirectory) then
          ASender.Response.Add(SRec.Name);
        iNumFiles := FindNext(SRec);
      end;
    finally
      FindClose(SRec)
    end
    else
      ASender.Response.Add('NO THUMBNAILS');
  except
  on e:exception do
  begin
    messagedlg('Error in procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames'+#13+
      'Error msg: ' + e.Message,mterror,[mbok],0);
   end;
  end;
end;`

Upvotes: 0

Javier Reinoso
Javier Reinoso

Reputation: 11

With winscp have a find window that is possible search for all files in directories and subdirectories from a directory in the own web - after is possible select all and copy, and have in text all links to all files -, need have the username and password for connect ftp:

https://winscp.net/eng/download.php

Upvotes: 1

Michael Aaron Safyan
Michael Aaron Safyan

Reputation: 95459

If you wget http://hyperquad.telascience.org/naipsource/Texas/20100801/, the HTML that is returned contains the list of files. If you don't need this to be general, you could use regexes to extract the links. If you need something more robust, you can use an HTML parser (e.g. BeautifulSoup), and programmatically extract the links on the page (from the actual HTML structure).

Upvotes: 1

Greg Dubicki
Greg Dubicki

Reputation: 6930

I would use lynx shell web browser to get the list of links + grep and awk shell tools to filter the results, like this:

lynx -dump -listonly <URL> | grep http | grep <regexp> | awk '{print $2}'

..where:

  • URL - is the start URL, in your case: http://hyperquad.telascience.org/naipsource/Texas/20100801/
  • regexp - is the regular expression that selects only files that interest you, in your case: \.tif$


Complete example commandline to get links to TIF files on this SO page:

lynx -dump -listonly http://stackoverflow.com/questions/6989681/getting-a-list-of-files-on-a-web-server | grep http | grep \.tif$ | awk '{print $2}'

..now returns:

http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif

Upvotes: 4

Richard Corfield
Richard Corfield

Reputation: 717

You'd need the server to be willing to give you a page with a listing on it. This would normally be an index.html or just ask for the directory.

http://hyperquad.telascience.org/naipsource/Texas/20100801/

It looks like you're in luck in this case so, at risk of upsetting the web master, the solution would be to use wget's recursive option. Specify a maximum recursion of 1 to keep it constrained to that single directory.

Upvotes: 5

Related Questions