Reputation: 23
I am looking for another approach to apply RIPEMD-160 to the second column of a csv file.
Here is my code
awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' /test/source.csv > /ziel.csv
I run it in a big csv file (1Go), it takes 2 days and I get only 100Mo, that means i need to wait a month to get all my new CSV.
Can you help me with another idea and approach to get my data faster.
Thanks in advance
Upvotes: 1
Views: 170
Reputation: 33685
Your solution hits Cygwin where it hurts the most: Spawning new programs. Cygwin is terrible slow at this.
You can make this faster by using all cores in you computer, but it will still be very slow.
You need a program that does not start other programs to compute the RIPEMD sum. Here is a small Python script that takes the CSV on standard input and outputs the CSV on standard output with the second column replaced with the RIPEMD sum.
riper.py:
#!/usr/bin/python
import hashlib
import fileinput
import os
key = os.environ['key']
for line in fileinput.input():
# Naiive CSV reader - split on ,
col = line.rstrip().split(",")
# Compute RIPEMD on column 2
h = hashlib.new('ripemd160')
h.update(col[1]+key)
# Update column 2 with the hexdigext
col[1] = h.hexdigest().upper();
print ','.join(col)
Now you can run:
cat source.csv | key=a python riper.py > ziel.csv
This will still only use a single core of your system. To use all core GNU Parallel can help. If you do not have GNU Parallel 20161222 or newer in your package system, it can be installed as:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
You will need Perl installed to run GNU Parallel:
key=a
export key
parallel --pipe-part --block -1 -a source.csv -k python riper.py > ziel.csv
This will on the fly chop source.csv into one block per CPU core and for each block run the python script. On my 8 core this processes a 1 GB file with 139482000 lines in 300 seconds.
If you need it faster still, you will need to convert riper.py
to a compiled language (e.g. C).
Upvotes: 0
Reputation: 10039
# prepare a batch (to avoir fork from awk)
awk -F "," -v env_var="$key" '
BEGIN {
print "if [ -r /tmp/MD160.Result ];then rm /tmp/MD160.Result;fi"
}
{
print "echo \"\$( echo -n \047" $2 env_var "\047 | openssl ripemd160 )\" >> /tmp/MD160.Result"
} ' /test/source.csv > /tmp/MD160.eval
# eval the MD for each line with batch fork (should be faster)
. /tmp/MD160.eval
# take result and adapt for output
awk '
# load MD160
FNR == NR { m[NR] = toupper($2); next }
# set FS to ","
FNR == 1 { FS = ","; $0 = $0 "" }
# adapt original line
{ $3 = m[FNR]; print}
' /tmp/MD160.Result /test/source.csv > /ziel.csv
Note:
Upvotes: 0
Reputation: 3127
you can use GNU Parallel to increase the speed of output by executing the awk command in parallel For explanation check here
cat /test/source.csv | parallel --pipe awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' > /ziel.csv
Upvotes: 1