Reputation: 6802
We're in a bind here at work. We do integrations with a lot of 3rd party vendors and one of them, Vendor X, is proving to be very difficult.
While the other vendors offer easy to use APIs that are typically RESTful JSON-based frameworks or XML based Web Services, Vendor X does everything over FTP. While other vendors compartmentalize their API calls to separate calls into specific tasks, Vendor X uses one TDF to do multiple tasks.
The latter problem threatens to kill off our ability to integrate because we only care about updating one dataset, but Vendor X will not let us supply this information without supplying a lot of other information that we don't know (or care about).
The thing is, Vendor X has a web portal where this information could be easily updated. Technically I could write a client that logs into this portal, scans the html to update a table, and submits a form.
Our application is written in Java, but most of the Java libraries that do this sort of task (Selenium, HTMLUnit) seem to be oriented for testing and not executing Enterprise-level tasks under heavy load. A library like JSoup does not work well with submitting forms as far as I can tell. Even though we're only updating one integer field per record and clicking submit I'm still a bit wary.
Is there a programmable headless browser (java or otherwise) that is suitable for this task? What does the community think of this approach in general? When there are no clean solutions for this task and we have to resort to this kind of hackery it's tempting for the development team to simply say, "we can't support the vendor in question", but because they are pretty lucrative, management doesn't see that as an option.
Upvotes: 1
Views: 395
Reputation: 115
If it can be done programmatically (e.g. by generating a GET or POST), that's the route I'd go before anything else. wget
or curl
...I've never tried to scale them to a task like this, but they've been fairly robust and low-footprint when I've used them for large sequential tasks. But I'm sure there are Java libraries that allow you to do similar and avoid the exec() (probably in java.net somewhere).
If it's a JS-heavy site, however, you may have to resort to PhantomJS. That's a JS-driven headless Webkit, so expect it to eat a lot of CPU time and memory. There are at least some efforts to do clustering for it, though. (There's a similar project for Gecko, but it's not headless yet.)
Upvotes: 1