Gere
Gere

Reputation: 12687

Writing a Python data analysis server for a Java interface

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.

What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)

I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?

I was thinking to learn how to use sockets and pass YAML strings. Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?

There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?

Does any of the methods have advantages for debugging or other pecularities?

I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.

(I do not consider solutions like Jython, JEPP, ...)

Upvotes: 0

Views: 231

Answers (1)

Mark Grey
Mark Grey

Reputation: 10257

Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.

http://twistedmatrix.com/documents/current/core/examples/

Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.

The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.

Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.

If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Upvotes: 1

Related Questions