Mike Preston
Mike Preston

Reputation: 163

UNIX file sort issue from Java

We have a Java program which requires a file to be sorted in mid-process. The file in question can possibly hold any printable character available from the keyboard. We are able to execute the sort OK on a standard single-character delimiter but when the sort encounters that character it parses incorrectly. We would like to use either a tab delimiter or multi-character delimiter so that the file sorts correctly regardless of the contents of the data. We are building the command string dynamically and passing it to the shell to execute, as shown below.

execStr = new StringBuffer("/usr/bin/sort -n +1n -2 +0n -1 -o " + outputFile.toString() + " -t " + DELIMITER + " " + outputFile.toString()); Process runProc = Runtime.getRuntime().exec (execStr.toString());

If we include the $ and tick marks to wrap the delimiter the sort fails to find the desired columns and sorts on the first column, which presents us with a problem in specifying the tab character as $'\t'. We have tried using characters outside the printable range, such as $'Ç' (hex C7) but the sort mechanism Java string publishes a question mark for the character such as $'?', which of course does not work for us. It seems like the way Java handles strings and how sort reads them is giving us fits. Has anyone else encountered this problem and if so, how did you solve it? Ideally using a multi-character delimiter would be best for us, but we'll take the tab char if we can get it to work.

Thanks in advance, Mike

Upvotes: 2

Views: 937

Answers (1)

Norman Gray
Norman Gray

Reputation: 12514

You're making this hard for yourself by using a convenience method!

First, what is $'\t' ? That's four characters, not any way of specifying a tab character.

The key thing is to note that in exec(command), the command string is split using a StringTokenizer which will split the command string on whitespace. Whitespace includes your tab character, which therefore disappears -- that's why including a literal tab character doesn't work.

Also (though this isn't really anything to do with your problem), your StringBuffer is redundant, since it's being initialised with a single string which is concatenated the usual way using +.

You'd be best to create the command using ProcessBuilder (as jackrabbit's comment suggested). That way, you control exactly what arguments are what, and if you include a literal tab character as one of the arguments, that's what'll be included in the argument passed to the program.

ProcessBuilder pb = new ProcessBuilder("/usr/bin/sort", "-t", "\t", ...);

It's very easy to make the mistake, in doing something like this, of forgetting that the shell does quite a lot of work on a command typed in a terminal, and that you don't have a shell doing that sort of escaping work in a context like this. The shell assembles an argument list consisting of an array of strings, and that's what's passed to exec(3). For sanity's sake, you want to skip intermediaries as much as possible, and assemble this string yourself.

Upvotes: 1

Related Questions