Remi.b
Remi.b

Reputation: 18219

Fastest way to merge millions of files

There are 81 million files (!) stored in one directory on a remote machine. All files end in ".paintedHaploDiversity". I would like to merge those files into one called allOutputs_3.5 in the parent directory. More specifically, each file contains two or three lines. The first line is a header that I can ignore. Among the remaining one or two lines, one of them has the value 2 in the fourth column. For each file, I want to copy the whole line where there is a 2 in the second column and add to it the filename (excluding the extension ".paintedHaploDiversity"). I refer to this filename as "simID".

For information, the remote machine runs on MAC OS X 10.11.6 (15G22010). It is a simple destkop. There is hence no network involved (outside my ssh command to reach the remote machine).

I first tried

for f in *;
do
   simID=${f%.paintedHaploDiversity}
   awk -v simID=${simID} 'NR>1{if ($4==2) {printf simID"\t"; print}}' $f >> ../allOutputs_3.5
done

but it was very slow. I estimated the time required to months or even years! Then, I tried

awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' * >> ../allOutputs

but it does not seem any faster. Just as a speed test, I also considered

find . -exec cat '{}' ';' > out

but it is again very slow. Thinking that maybe the issue might come from the regex expansion *, I tried to loop through each file by reproducing their name through two C style loops.

for ((bigID=1; bigID <= 9 ;++bigID)); do
   for ((rep=1; rep <= 9000000 ;++rep)); do
      awk -v simID=3.5.${bigID}_${rep} 'NR>1{if ($4==2) {printf simID"\t"; print}}' 3.5.${bigID}_${rep}.paintedHaploDiversity >> ../allOutputs_3.5
   done
done

The process is now quite a bit faster but it would still take months to run! Finally, I figured, I might as well remove the lines where the the second column is not equal to 2 only later on (probably with a sed command) and do

for ((bigID=1; bigID <= 6 ;++bigID)); do
   for ((r=1; r <= 9000000 ;++r)); do
      printf "3.5_${bigID}_${r}\t"  >> ../allOutputs_3.5
      tail -n +2 3.5_${bigID}_${r}.paintedHaploDiversity >> ../allOutputs_3.5
   done
done

The process now is expected to take about two weeks. That starts to be reasonable. I am still wondering what is causing this process to be so slow and whether it can be improved.

I suppose the bottleneck is likely the disk IO. Or is it the filesystem that takes a lot of CPU time? Is the process so slow because there are so many files in the same directory and it requires searching through a binary tree of files at every iteration of the loop? How can it be improved? Should I try writing the process in c++?

If it helps here is the output of top -o MEM while the last command (the one using printf and tail) was running

Processes: 254 total, 3 running, 12 stuck, 239 sleeping, 1721 threads                            03:12:40
Load Avg: 2.04, 1.79, 1.60  CPU usage: 0.84% user, 4.33% sys, 94.81% idle
SharedLibs: 85M resident, 11M data, 10M linkedit.
MemRegions: 42324 total, 4006M resident, 63M private, 230M shared.
PhysMem: 14G used (2286M wired), 10G unused.
VM: 753G vsize, 535M framework vsize, 1206153(0) swapins, 2115303(0) swapouts.
Networks: packets: 413664671/284G in, 126210468/104G out.
Disks: 1539349069/12T read, 1401722156/7876G written.

PID    COMMAND      %CPU TIME     #TH    #WQ  #PORTS MEM    PURG  CMPRS  PGRP  PPID  STATE
0      kernel_task  42.1 1716 hrs 167/25 0    2-     1968M  0B    0B     0     0     running
366    SystemUIServ 0.4  24:42:03 5      2    345    1055M  0B    10M    366   1     sleeping
472    softwareupda 0.0  12:46:11 5      0    3760   340M   0B    18M    472   1     sleeping
54242  Sublime Text 0.0  03:55:44 12     0    237    233M   0B    68K    54242 1     sleeping
63     powerd       0.0  44:07:21 2      0    95     204M   0B    8932K  63    1     sleeping
34951  Finder       0.1  04:11:06 9      2    1665   166M   0B    68M    34951 1     sleeping
197    WindowServer 0.0  40:02:58 3      0    453    142M   0B    63M    197   1     sleeping
13248  Terminal     0.0  84:19.45 5      0    388    114M   0B    113M   13248 1     sleeping
29465  X11.bin      0.0  89:38.70 9      0    229    104M   0B    16M    29464 29464 sleeping
12372  system_insta 0.0  00:31.61 2      0    75     78M    0B    9996K  12372 1     sleeping
1588   sysmond      0.0  02:34:04 2      1    23     62M    0B    4536K  1588  1     sleeping
54245  plugin_host  0.0  00:03.88 5      0    56     51M    0B    0B     54242 54242 sleeping
554    spindump     0.0  00:36.51 2      1    164    44M    0B    33M    554   1     sleeping
20024  com.apple.GS 0.0  00:01.43 3      2    24     43M    0B    2200K  20024 1     sleeping
475    suhelperd    0.0  00:19.84 2      0    55     42M    0B    28M    475   1     sleeping
418    installd     0.0  01:21.89 2      0    69     40M    0B    12M    418   1     sleeping
57     fseventsd    0.1  13:03:20 10     0    241    39M    0B    2904K  57    1     sleeping
364    Dock         0.0  08:48.83 3      0    283    38M    0B    27M    364   1     sleeping
201    sandboxd     0.0  18:55.44 2      1    38     38M    0B    10M    201   1     sleeping
103    loginwindow  0.0  04:26.65 2      0    377    35M    0B    3400K  103   1     sleeping
897    systemstatsd 0.0  65:30.17 2      1    43     34M    0B    4928K  897   1     sleeping
367    fontd        0.0  11:35.30 2      0    77     32M    0B    5920K  367   1     sleeping
396    ScopedBookma 0.0  01:00.46 3      2    46     32M    0B    28M    396   1     sleeping
22752  cfbackd      0.4  32:18.73 9      1    84     30M    0B    0B     22752 1     sleeping
39760  Preview      0.0  00:03.75 3      0    209    29M    0B    0B     39760 1     sleeping
53     syslogd      0.0  05:33:59 4      3    186-   29M-   0B    1668K  53    1     sleeping
533    SmartDaemon  0.0  27:07.67 10     7    175    28M    128K  5192K  533   1     stuck   
388    iconservices 0.0  00:08.85 2      1    66     27M    0B    157M   388   1     sleeping
7268   diskmanageme 0.0  00:40.14 888    0    8899   27M    0B    7352K  7268  1     sleeping
513    Notification 0.0  00:46.42 3      0    245    26M    0B    9852K  513   1     sleeping
83     opendirector 0.0  19:22:12 6      5    8827   26M    0B    2444K  83    1     sleeping
557    AppleSpell   0.0  03:12.61 2      0    57     26M    0B    10M    557   1     sleeping
422    com.apple.ge 0.0  01:50.41 5      0    83     25M    0B    1680K  422   1     sleeping
397    storeaccount 0.0  00:48.41 4      0    1333   21M    0B    2248K  397   1     sleeping
87     launchservic 0.0  64:26.85 3      2    306    20M    0B    5804K  87    1     sleeping
1      launchd      0.0  26:26:23 5      4    1802   20M    0B    6532K  1     0     stuck   
222    taskgated    0.0  17:59:00 3      1    43     19M    0B    4528K  222   1     sleeping
54     UserEventAge 0.0  18:19.74 3      0    32605- 18M-   0B    2968K  54    1     sleeping
4527   com.apple.sp 0.0  00:13.01 2      0    48     17M    0B    7792K  4527  1     sleeping
79     coreduetd    0.0  05:40.06 2      0    95     17M    0B    4604K  79    1     sleepin

and here is the output of iostat

      disk0           disk1           disk2       cpu     load average
KB/t tps  MB/s     KB/t tps  MB/s     KB/t tps  MB/s  us sy id   1m   5m   15m
7.19 152  1.07     8.10   0  0.00     8.22   0  0.00  15 50 35  1.68 1.74 1.59

Example:

Consider the following files

file_0:

first second third fourth fifth
bbb a a 2 r

file_1:

first second third fourth fifth
f o o 2 o

file_2:

first second third fourth fifth
f r e 1 e
x xxx x 2 x

file_3:

first second third fourth fifth
a a a 2 a

The expected output is

file_0 bbb a a 2 r
file_1 f o o 2 o
file_2 x xxx x 2 x
file_3 a a a 2 a

Upvotes: 5

Views: 1864

Answers (5)

thanasisp
thanasisp

Reputation: 5975

Any solution with a bash loop, where you call million times one or more processes, would be very slow. Also the attempt awk '{...}' * > output for me, on linux, resulted to: bash: /usr/bin/awk: Argument list too long.


With find and xargs

find is what you have to use, not with -exec because this way you will call again million processes per file argument, but with xargs, this way you can pass tons of arguments to one process. You can also do the job in batches with xargs -n. In general it is possible to hit any limitation of your OS, bash arguments, etc, but I have not tested for a huge number.


I executed the solution below on an very old box, slower than the desktop in question, and a sample of 800K files (1% of the total in question) took 3 minutes.

find . -type f -printf "%f\n" |\
xargs awk '$4==2{ print(substr(FILENAME, 1, length(FILENAME)-22), $0) }' >> output.txt

First, you have to avoid Swap Usage during execution or else it will slow down dramatically and second, you will probably hit any limits, as said above. So it may need to be done in batches, e.g you run find once and save the results to file, split the file to batches (e.g. 1M filenames each) and xargs each chunk to awk.


Without find, creating filenames with loop: Use xargs again

I see that you can create the filenames in a bash loop as they follow a standard pattern, which could be faster of find, but I believe this is not the bottleneck anyway. Again you should not execute one command per argument, but provide this file to awk through xargs.

For example, create the filenames, with your loop and save them to file.

for (( i=1;i<=9;i++ )); do
   for (( j=1;j<=9000000;j++ )); do
      printf "file_%s_%s\n" "$i" "$j" >> filenames.txt
   done
done

and feed them once to awk:

cat filenames.txt | xargs awk '{...}'

or in batches, e.g. of 1M

split -l 1000000 -d filenames.txt chunk
for f in chunk*; do cat "$f" | xargs awk '{...}' ; done

Upvotes: 0

markp-fuso
markp-fuso

Reputation: 34663

If the printf/tail attempt is considered the fastest at this point (2 weeks? based solely on OPs comments), I'd want to eliminate the 81 million printf/tail command pairs with a smaller number of awk/substr(FILENAME) calls that work on a wildcard set that breaks processing into, say, ~10K files at a time, eg:

for bigID in {1..6}
do
    # poll first 99 files (r=1..99) + 9 millionth file

    awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_{1..99}.paintedHaploDiversity 3.5_${bigID}_9000000.paintedHaploDiversity >> ../allOutputs

    # break rest of files into ~10K chunks based on first 3 digits of suffix

    for r in {100..899}      # break 9000000 into ~10K chunks
    do
        awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_${r}*.paintedHaploDiversity >> ../allOutputs
    done
done

NOTE: I'm only picking 10K as an assumption that there's some sort of performance hit for awk grabbing a bigger set of file ids; some testing of this size may find a sweet spot on the number of files awk can (quickly) handle


Also, iostat is showing 3x disks. If these are 3x physically separate disks and they're attached as separate disks (ie, not part of a RAID config), then make sure the target file (allOutputs_3.5) resides on a different disk from the source files. This should cut down on the read->write->read->write thrashing (more so on HDDs, less so on SSDs).

NOTE: This (obviously) assumes there is room on the other disk(s) to hold the target file.

I'd probably want to test this idea (read from disk #1, write to disk #2) with a small subset of files (eg, 110K), using each of the previously mentioned coding attempts, to see if there's a (relatively) large diff in timings (thus pointing at the read/write thrashing as being one bottleneck).

Upvotes: 0

Socowi
Socowi

Reputation: 27235

You probably can cope with two single calls to the programs grep and sed. This should be pretty fast. Maybe even faster than a self-written C program.

cd dir_with_all_the_files
grep -rE '^([^ ]+ +){3}2 ' . | 
sed -En 's/^\.\/(.*)\.paintedHaploDiversity:/\1 /p' > ../allOutputs_3.5

Assumptions made:

  • The header of the column to be searched isn't 2 too.
  • The directory contains no subdirectories.
    The command may still produce correct results but has to run needlessly long.
  • The filenames contain no : or linebreaks.
  • Your grep implementation supports the non-Posix -r option (usually the case).

Further improvements if your grep implementation supports it:

  • Add -m1 to speed up the search.
  • Try grep -P (usually not supported on Mac OS) or pcregrep. PCRE is sometimes faster. With PCRE you can also try the alternative regex '^(.*? ){3}2 '.
  • --exclude-dir \* (note that * is quoted) excludes subdirectories, so that you can use the command even without above assumption.

If you want the output to be sorted by filenames (as you would get when iterating *.paintedHaploDiversity), run sort -t ' ' -k 1,1 -o allOutputs_3.5{,} afterwards.

You might as well set export LC_ALL=C to speed up grep, sort, and maybe even sed.

Upvotes: 3

Peter - Reinstate Monica
Peter - Reinstate Monica

Reputation: 16026

The problem — apart from the obvious I/O load of processing a few GB of data — is that starting one or several processes 81 million times takes a long time. Even creating a command line or expand a files glob to, say, 300MB (for f in *...) may need substantial time or exceed system and program specifications.

One solution is to write a C program to open the files and process them, or pipe their contents to other programs. But that may take a couple of days to program and debug, and maybe your intern is on vacation. But there are already programs in the Unix toolbox which do part of what you need, except that the file names are lost. We assume that all files are in a directory called bla.

Use tar to create a stream with the contents of the files, like this:

tar cf - bla | tar -xOf -

This writes the concatenated contents of the files to standard out, by default the console. Both tars and the grep are only started once. The first tar finds all the files in the directory and creates an archive (which is some sort of structured concatenation) which it writes to stdout; the second tar grabs that archive, extracts the files and writes them to stdout instead of creating files in the file system, thanks to -O.

After that, start processing:

tar cf - bla | tar -xOf - | grep '^whatever is before the 2 \<2\>' > out.txt

If the presence of the filenames is a hard requirement you may be able to repeat the processing chain, but let the second tar emit the file names (-t option), and pipe it to a shell script which reads a line from the out.txt and from the tar output, combining the two and writing the combined lines to a new file.

Upvotes: 0

Thomas
Thomas

Reputation: 181825

Difficult problem. Might have painted yourself into a corner there...

If even the find command takes too long, which does nothing but open, read and close every file, then the likely bottleneck is the seek time on an HDD. This is typically around 10 ms (source), so for 81 million files you're looking at about almost 10 days, assuming a single seek per file. Due to the filesystem (directory accesses etc.) it might be more seeks, but if locality is good each seek might also be shorter.

If you can afford to wait this long once, I'd recommend zipping up all those files into a single file. This will take a lot of time, but after that you can process the data set more quickly.

If zipping (or otherwise copying or accessing) each individual file is out of the question, a solution might be to take an image (snapshot) of the entire filesystem and copy that onto a faster drive. SSDs have seek times around 0.1 ms (source) so working off of an SSD you could be finished in slightly over two hours.

A more hardcore approach would be to write code that operates directly on the raw disk bytes, implementing the necessary parts of the filesystem and using large in-memory buffers to avoid disk seeks. Depending on how the files are scattered across the disk, this might give you a big speedup, but of course it's a nontrivial effort to program this.

Upvotes: 1

Related Questions