user1005974
user1005974

Reputation: 33

Python programming with FTP and lists

My main goal is to check an FTP server at anytime for a new file hits and then generate a .txt file with only the new files copied there. If there are no new files then it returns nothing. Here is what I have so far. I have started by copying the files from the server into oldlist.txt, then connecting to the FTP site and comparing data from newlist.txt and oldlist.txt and the differences I want in Temporary FTP file changes.txt. Each time I connect I will change newlist.txt and make it the oldlist.txt so that I can compare the next time I connect. Is there a better way to do this? My lists seem to never change data each time. Sorry if this is confusing thanks.

import os
filename = "oldlist.txt"
testing = "newlist.txt"
tempfilename = "Temporary FTP file Changes.txt"

old = open(filename, "r")
oldlist = old.readlines()
oldlist.sort()


from ftplib import FTP
ftp = FTP("ftpsite", "username", "password")
ftp.set_pasv(False)
newlist = []
ftp.dir(newlist.append)
newlist.sort()
ftp.close()

bob = open(testing, "w")
for nl in newlist:
    bob.write(nl + "\n")


hello = open(tempfilename, "w")

for c in newlist:
    if c not in oldlist:
    hello.write(c + "\n")

bob.close()
old.close()   
hello.close()

os.remove("oldlist.txt")

os.rename("newlist.txt", "oldlist.txt")

Upvotes: 3

Views: 970

Answers (2)

wberry
wberry

Reputation: 19347

Your implementation of this scheme is reasonable. I would not choose this scheme to implement automated FTP messaging, if that is what you're doing. There are two weaknesses of this approach:

  • It does not support filenames that repeat. Any filename that occurs in the "old" history will not be detected as a new file. Maybe this is a problem for you, maybe not. But even if filenames are guaranteed unique now, that may not always be true.
  • It does not tell you whether a new file is ready to be consumed or not. It is possible that a new file will be processed while it is still being uploaded. Some people apply a "no change in size for X seconds" rule, but that just increases delay and still leaves a vulnerability to severed connections.

One scheme that is similar but does not have either of these two problems is to actually store a file on the server with a reserved name, or in a separate place, and use its timestamp (preferably the modification time of the file itself) to decide which files can be safely processed. This "semaphore" file is updated to the current time as the last step in uploading a file. All files with a modification time older than the semaphore timestamp can be processed. Once processed, all files must be deleted out of the upload folder so they won't be processed twice. I have seen this scheme work well in an automated production data flow.

Upvotes: 0

Raymond Hettinger
Raymond Hettinger

Reputation: 226346

It's a little easier/faster to convert the lists to a set and not worry about sorting.

for filename in set(newlist) - set(oldlist):
    print 'New file: ', filename

Also, instead of saving the list to a file as raw text, you could use the shelve module to make a persistent store that is conveniently accessible like a regular Python dict.

Otherwise, your code has the virtues of being simple and straight-forward.

Here's a worked out example:

from ftplib import FTP
import shelve

olddir = shelve.open('filelist.shl')   # create a persistent dictionary

ftp = FTP('ftp1.freebsd.org')
ftp.login()

result = []
ftp.dir(result.append)
newdir = set(result[1:])

print ' New Files '.center(50, '=')
for line in sorted(set(newdir) - set(olddir)):
    print line
    olddir[line] = ''
print ' Done '.center(50, '=')
olddir.close()

Upvotes: 3

Related Questions