Proper way to encapsulate socket data in python?

Question

I'm working on an application that sends and receives data to another instance of itself via sockets, and I'm curious as to the most efficient way to encapsulate the data with an "END" tag. For instance, here are two functions used to read and write across the socket connection:

def sockWrite(conn, data):
    data = data + ":::END"
    conn.write(data)

def sockRead(conn):
    data = ""
    recvdata = conn.read()
    while recvdata:
        data = data + recvdata
        if data.endswith(':::END'):
            data = data[:len(data)-6]
            break
        recvdata = conn.read()
    if data == "":
        print 'SOCKR: No data')
    else:
        print 'SOCKR: %s', data)
    return data

I'm basically tacking an ":::END" onto the write, because multiple reads could occur for this single write. Thus, the read loops until it hits the ":::END".

This of course causes a problem if the data variable contains the string ":::END" which happens to come at the end of one of the reads.

Is there a proper way to encapsulate the data with as minimum of bandwidth addition as possible? I had thought about pickle or json, but worried that will add a significant amount of bandwidth since I believe they will convert the binary data to ASCII. Am I correct with that?

Thanks, Ben

abarnert · Accepted Answer

Zeroth: Do you really need to optimize this?

Usually you send relatively small messages. Shaving 60 bytes off a 512-byte message is usually silly when you look at how much ethernet, IP, and TCP overhead you're ignoring, and the RTT that swamps the bandwidth.

On the other hand, when you are sending huge messages, there's often no need to send multiple messages on the same connection.

Look at common internet protocols like HTTP, IMAP, etc. Most of them use line-delimited, human-readable, easily-debuggable plain text. HTTP can send "the rest of the message" in binary, but then you close the socket after you finish sending.

99% of the time, this is good enough. If you don't think it's good enough in your case, I'd still write the text version of your protocol, and then add an optional binary version once you've got everything debugged and working (and then test to see whether it really makes a difference).

Meanwhile, there are two problems with your code.

First, as you recognize, if you're using ":::END" as a delimiter, and your messages can include that string in their data, you have an ambiguity. The usual way to solve this problem is some form of escaping or quoting. For a really simple example:

def sockWrite(conn, data):
    data = data.replace(':', r'\:') + ":::END"
    conn.write(data)

Now on the read side, you just pull off the delimiter, and then replace('r\:', ':') on the message. (Of course it's wasteful to escape every colon just to use a 6-byte ':::END' delimiter—you might as well just use an unescaped colon as a delimiter, or write a more complex escaping mechanism.)

Second, you're right that "multiple reads could occur for this single write"—but it's also true that multiple writes could occur for this single read. You could read half of this message, plus half of the next. This means you can't just use endswith; you have to use something like partition or split, and write code that can handle multiple messages, and also write code that can store partial messages until the next time through the read loop.

Meanwhile, to your specific questions:

Is there a proper way to encapsulate the data with as minimum of bandwidth addition as possible?

Sure, there are at least three proper ways: Delimiters, prefixes, or self-delimiting formats.

You've already found the first. And the problem with it: unless there's some string that can never possibly appear in your data (e.g., '\0' in human-readable UTF-8 text), there is no delimiter you can pick that won't require escaping.

A self-delimiting format like JSON is the easiest solution. When the last opened brace/bracket closes, the message is over, and it's time for the next one.

Alternatively, you can prefix each message with a header that includes the length. This is what many lower-level protocols (like TCP) do. One of the simplest formats for this is netstring, where the header is just the length in bytes as an integer represented as a normal base-10 string, followed by a colon. The netstring protocol also uses a comma as a delimiter, which adds some error checking.

I had thought about pickle or json, but worried that will add a significant amount of bandwidth since I believe they will convert the binary data to ASCII

pickle has both binary and text formats. As the documentation explains, if you use protocol 2, 3, or HIGHEST_PROTOCOL, you will get a reasonably efficient binary format.

JSON, on the other hand, only handles strings, numbers, arrays, and dictionaries. You have to manually render any binary data into a string (or an array of strings or numbers, or whatever) before you can JSON-encode it, and then reverse things on the other side. Two common ways to do this is are base-64 and hex, which add 25% and 100% respectively to the size of your data, but there are more efficient ways to do it if you really need to.

And of course the JSON protocol itself uses a few more characters than strictly necessary, what with all those quotes and commas and so on, and whatever names you give to any fields are sent as uncompressed UTF-8. You can always replace JSON with BSON, Protocol Buffers, XDR, or other serialization formats that are less "wasteful" if it's really an issue.

Meanwhile, pickle isn't self-delimiting. You have to first split the messages apart, before you can unpickle them. JSON is self-delimiting, but you can't just use json.loads unless you first split the messages apart; you'll have to write something more complicated. The simplest thing that works is to repeatedly call raw_decode on the buffer until you get an object.

Proper way to encapsulate socket data in python?

Answers (1)

Related Questions