Reputation: 151

Calculating Overall Median from Multiple Files Using GNU parallel

I have the following command:

find 01/ -type f -name '*.csv.gz' | parallel "pigz -dc {} | datamash -t, median 3"

This command, for each .csv.gz file found in the '01/' directory, decompresses the file and calculates the median of the values in the third column of each file. However, I'm looking to calculate the overall median across all files rather than the median for each file individually.

P.S. I've tried running:

find 01/ -type f -name '*.csv.gz' | parallel "pigz -dc {} | datamash -t, median 3" | datamash median 1

But this seems to provide the "median of medians" which is not the result I'm seeking.

Upvotes: 0

Answers (1)

Ole Tange

Reputation: 33740

This is really a comment, but too long.

To get the exact median of n numbers you need O(n) space. This is because if The Devil designs the input, he can force any place in the sequence to be the median, and you will have no way of ruling out any place until you have read at least n/2 numbers.

If, however, The Devil has not designed the input, and the input is more or less random or normally distributed, then we can get the correct value most of the time using Remedian (Rousseeuw, Peter J., and Gilbert W. Bassett Jr. "The remedian: A robust averaging method for large data sets." Journal of the American Statistical Association 85.409 (1990): 97-104). Remedian uses O(1) space.

GNU Parallel uses Remedian internally in set_remedian; and yes: It really is only ~10 lines of code.

https://git.savannah.gnu.org/cgit/parallel.git/tree/src/parallel#n14377

So I would run something like:

find 01/ -type f -name '*.csv.gz' -exec pigz -dc {} + | awk '{print $3}' | remedian

where remedian is your implementation of Remedian.

Upvotes: 3

Calculating Overall Median from Multiple Files Using GNU parallel

Answers (1)

Related Questions