Reputation: 1112
I have a big csv
file containing 60210 lines. Those lines contains hashes, paths and file names, like so:
hash | path | number | hash-2 | name
459asde2c6a221f6... | folder/..| 6 | 1a484efd6.. | file.txt
777abeef659a481f... | folder/..| 1 | 00ab89e6f.. | anotherfile.txt
....
I am filtering this file regarding a list of hashes, and to the facilitate the filtering process, I create and use a reduced version of this file, like so:
hash | path
459asde2c6a221f6... | folder/..
777abeef659a481f... | folder/..
The filtered result contains all the lines that have a hash which is not present in my reference hash base.
But to make a correct analysis of the filtered result, I need the previous data that I removed. So my idea was to read the filtered result file, search for the hash
field, and write it in an enhanced result file that will contain all the data.
I use a loop to do so:
getRealNames() {
originalcontent="$( cat $originalfile)"
while IFS='' read -r line; do
hash=$( echo "$line" | cut -f 1 -d " " )
originalline=$( echo "$originalcontent" |grep "$hash" )
if [ ! -z "$originalline" ]; then
echo "$originalline" > "$resultenhanced"
fi
done < "$resultfile"
}
But in real usage, it is highly inefficient: for the previous file, this loop takes approximately 3 hours to run on a 4Go RAM, Intel Centrino 2 system, and it seems to me way too long for this kind of operation.
Is there any way I can improve this operation?
Upvotes: 1
Views: 122
Reputation: 896
Your explanation of what you are trying to do unclear because it describes two tasks: filtering data and then adding missing values back to the filtered data. Your sample script addresses the second, so I'll assume that's the what you are trying to solve here.
As I read it, you have a filtered result that contains hashes and paths, and you need to lookup those hashes in the original file to get the other field values. Rather than loading the original file into memory, just let grep process the file directly. Assuming a single space (as indicated by cut -d " "
) is your field separator, you can extract the hash in your read command, too.
while IFS=' ' read -r hash data; do
grep "$hash" "$originalfile" >> "$resultenhanced"
done < "$resultfile"
Upvotes: -2
Reputation: 85560
Given the nature of your question, it is hard to understand why you would prefer using the shell to process such a huge file given specialized tools like awk
or sed
to process them. As Stéphane Chazelas points out in the wonderful answer in Unix.SE.
Your problem becomes easy to solve once you use awk
/perl
which speeds up the text processing. Also you are consuming the whole file into RAM by doing originalcontent="$( cat $originalfile)"
which is not desirable at all.
Assuming in the both the original and reference file, the hash
starts at the first column and the columns are separated by |
, you need to use awk
as
awk -v FS="|" 'FNR==NR{ uniqueHash[$1]; next }!($1 in uniqueHash)' ref_file orig_file
The above attempts takes into memory only the first column entries from your reference file, the original file is not consumed at all. Once we consume the entries in $1
(first column) of the reference file, we do filter on the original file by selecting those lines that are not in the array(uniqueHash
) we created.
Change your locale
settings to make it even faster by setting the C
locale as LC_ALL=C awk ...
Upvotes: 4