Reputation: 2902
I'm developing a python script that monitors a directory (using libinotify) for new files and for each new file it does some processing and then copies it to a storage server. We were using a NFS mount but had some performance issues and now we are testing with FTP. It looks that FTP is using far less resources than nfs ( the load is always under 2, with nfs it was above 5 ).
The problem we are having now is the amount of connections that keeps open in TIME_WAIT state. The storage has peaks of about 15k connections in time wait.
I was wondering if there is some way to re-use previous connection for new transfers.
Anyone knows if there is some way of doing that?
Thanks
Upvotes: 1
Views: 3260
Reputation: 365657
Here's a new answer, based on the comments to the previous one.
We'll use a single TCP socket, and send each file by alternating sending name and contents, as netstrings, for each file, all in one big stream.
I'm assuming Python 2.6, that the filesystems on both sides use the same encoding, and that you don't need lots of concurrent clients (but you might occasionally need, say, two—e.g., the real one, and a tester). And I'm again assuming you've got a module filegenerator
whose generate()
method registers with inotify
, queues up notifications, and yield
s them one by one.
client.py:
import contextlib
import socket
import filegenerator
sock = socket.socket()
with contextlib.closing(sock):
sock.connect((HOST, 12345))
for filename in filegenerator.generate():
with open(filename, 'rb') as f:
contents = f.read()
buf = '{0}:{1},{2}:{3},'.format(len(filename), filename,
len(contents), contents)
sock.sendall(buf)
server.py:
import contextlib
import socket
import threading
def pairs(iterable):
return zip(*[iter(iterable)]*2)
def netstrings(conn):
buf = ''
while True:
newbuf = conn.recv(1536*1024)
if not newbuf:
return
buf += newbuf
while True:
colon = buf.find(':')
if colon == -1:
break
length = int(buf[:colon])
if len(buf) >= colon + length + 2:
if buf[colon+length+1] != ',':
raise ValueError('Not a netstring')
yield buf[colon+1:colon+length+1]
buf = buf[colon+length+2:]
def client(conn):
with contextlib.closing(conn):
for filename, contents in pairs(netstrings(conn)):
with open(filename, 'wb') as f:
f.write(contents)
sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
with contextlib.closing(sock):
sock.bind(('0.0.0.0', 12345))
sock.listen(1)
while True:
conn, addr = sock.accept()
t = threading.Thread(target=client, args=[conn])
t.daemon = True
t.start()
If you need more than about 200 clients on Windows, 100 on linux and BSD (including Mac), a dozen on less good platforms, you probably want to go with an event loop design instead of a threaded design, using epoll
on linux, kqueue
on BSD, and IO completion ports on Windows. This an be painful, but fortunately, there are frameworks that wrap everything up for you. Two popular (and very different) choices are Twisted and gevent.
One nice thing about gevent
in particular is that you can write threaded code today, and with a handful of simple changes turn it into event-based code like magic.
On the other hand, if you're eventually going to want event-based code, it's probably better to learn and use a framework from the start, so you don't have to deal with all the fiddly bits of accept
ing and looping around recv
until you get a full message and shutting down cleanly and so on, and just write the parts you care about. After all, more than half the code above is basically boilerplate for stuff that every server shares, so if you don't have to write it, why bother?
In a comment, you said:
Also the files are binary, so it's possible that I'll have problems if client encodings are diferent from server's.
Notice that I opened each file in binary mode ('rb'
and 'wb'
), and intentionally chose a protocol (netstrings) that can handle binary strings without trying to interpret them as characters or treat embedded NUL characters as EOF or anything like that. And, while I'm using str.format
, in Python 2.x that won't do any implicit encoding unless you feed it unicode
strings or give it locale-based format types, neither of which I'm doing. (Note that in 3.x, you'd need to use bytes
instead of str
, which would change a bit of the code.)
In other words, the client and server encodings don't enter into it; you're doing a binary transfer exactly the same as FTP's I mode.
But what if you wanted the opposite, to transfer text and reencode automatically for the target system? There are three easy ways to do that:
Going with the third option, assuming that the files are going to be in your default filesystem encoding, the changed client code is:
with io.open(filename, 'r', encoding=sys.getfilesystemencoding()) as f:
contents = f.read().encode('utf-8')
And on the server:
with io.open(filename, 'w', encoding=sys.getfilesystemencoding()) as f:
f.write(contents.decode('utf-8'))
The io.open
function also, by default, uses universal newlines, so the client will translate anything into Unix-style newlines, and the server will translate to its own native newline type.
Note that FTP's T mode actually doesn't do any re-encoding; it only does newline conversion (and a more limited version of it).
Upvotes: 1
Reputation: 365657
Yes, you can reuse connections with ftplib
. All you have to do is not close them and keep using them.
For example, assuming you've got a module filegenerator
whose generate()
method registers with inotify
, queues up notifications, and yield
s them one by one:
import ftplib
import os
import filegenerator
ftp = ftplib.FTP('ftp.example.com')
ftp.login()
ftp.cwd('/path/to/store/stuff')
os.chdir('/path/to/read/from/')
for filename in filegenerator.generate():
with open(filename, 'rb') as f:
ftp.storbinary('STOR {}'.format(filename), f)
ftp.close()
I'm a bit confused by this:
The problem we are having now is the amount of connections that keeps open in TIME_WAIT state.
It sounds like your problem is not that you create a new connection for each file, but that you never close the old ones. In which case the solution is easy: just close them.
Either that, or you're trying to do them all in parallel, but don't realize that's what you're doing.
If you want some parallelism, but not unboundedly so, you can easily, e.g. create a pool of 4 threads, each with an open ftplib
connection, each reading from a queue, and then an inotify
thread that just pushed onto that queue.
Upvotes: 0