Search and write line of a very large file in bash

Question

I have a big csv file containing 60210 lines. Those lines contains hashes, paths and file names, like so:

hash                 | path     | number | hash-2      | name 
459asde2c6a221f6...  | folder/..| 6      | 1a484efd6.. | file.txt
777abeef659a481f...  | folder/..| 1      | 00ab89e6f.. | anotherfile.txt
....

I am filtering this file regarding a list of hashes, and to the facilitate the filtering process, I create and use a reduced version of this file, like so:

hash                 | path     
459asde2c6a221f6...  | folder/..
777abeef659a481f...  | folder/..

The filtered result contains all the lines that have a hash which is not present in my reference hash base.

But to make a correct analysis of the filtered result, I need the previous data that I removed. So my idea was to read the filtered result file, search for the hash field, and write it in an enhanced result file that will contain all the data.

I use a loop to do so:

getRealNames() {
    originalcontent="$( cat $originalfile)"
    while IFS='' read -r line; do
        hash=$( echo "$line" | cut -f 1 -d " " )
        originalline=$( echo "$originalcontent"  |grep "$hash" )
        if [ ! -z "$originalline" ]; then
            echo "$originalline" > "$resultenhanced"
        fi
    done < "$resultfile"
}

But in real usage, it is highly inefficient: for the previous file, this loop takes approximately 3 hours to run on a 4Go RAM, Intel Centrino 2 system, and it seems to me way too long for this kind of operation.

Is there any way I can improve this operation?

Inian · Accepted Answer

Given the nature of your question, it is hard to understand why you would prefer using the shell to process such a huge file given specialized tools like awk or sed to process them. As Stéphane Chazelas points out in the wonderful answer in Unix.SE.

Your problem becomes easy to solve once you use awk/perl which speeds up the text processing. Also you are consuming the whole file into RAM by doing originalcontent="$( cat $originalfile)" which is not desirable at all.

Assuming in the both the original and reference file, the hash starts at the first column and the columns are separated by |, you need to use awk as

awk -v FS="|" 'FNR==NR{ uniqueHash[$1]; next }!($1 in uniqueHash)' ref_file orig_file

The above attempts takes into memory only the first column entries from your reference file, the original file is not consumed at all. Once we consume the entries in $1 (first column) of the reference file, we do filter on the original file by selecting those lines that are not in the array(uniqueHash) we created.

Change your locale settings to make it even faster by setting the C locale as LC_ALL=C awk ...

Search and write line of a very large file in bash

Answers (2)

Related Questions