Reputation: 23

Another approach to apply RIPEMD in CSV file

I am looking for another approach to apply RIPEMD-160 to the second column of a csv file.

Here is my code

awk -F "," -v env_var="$key" '{
    tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
    if ( (tmp | getline cksum) > 0 ) {
        $3 = toupper(cksum)
    }
    close(tmp)
    print
}' /test/source.csv > /ziel.csv

I run it in a big csv file (1Go), it takes 2 days and I get only 100Mo, that means i need to wait a month to get all my new CSV.

Can you help me with another idea and approach to get my data faster.

Thanks in advance

Upvotes: 1

Answers (3)

Ole Tange

Reputation: 33685

Your solution hits Cygwin where it hurts the most: Spawning new programs. Cygwin is terrible slow at this.

You can make this faster by using all cores in you computer, but it will still be very slow.

You need a program that does not start other programs to compute the RIPEMD sum. Here is a small Python script that takes the CSV on standard input and outputs the CSV on standard output with the second column replaced with the RIPEMD sum.

riper.py:

#!/usr/bin/python                                                                                  

import hashlib
import fileinput
import os

key = os.environ['key']

for line in fileinput.input():
    # Naiive CSV reader - split on ,                                                               
    col = line.rstrip().split(",")
    # Compute RIPEMD on column 2                                                                   
    h = hashlib.new('ripemd160')
    h.update(col[1]+key)
    # Update column 2 with the hexdigext                                                           
    col[1] = h.hexdigest().upper();
    print ','.join(col)

Now you can run:

cat source.csv | key=a python riper.py > ziel.csv

This will still only use a single core of your system. To use all core GNU Parallel can help. If you do not have GNU Parallel 20161222 or newer in your package system, it can be installed as:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

You will need Perl installed to run GNU Parallel:

key=a
export key
parallel --pipe-part --block -1 -a source.csv -k python riper.py > ziel.csv

This will on the fly chop source.csv into one block per CPU core and for each block run the python script. On my 8 core this processes a 1 GB file with 139482000 lines in 300 seconds.

If you need it faster still, you will need to convert riper.py to a compiled language (e.g. C).

Upvotes: 0

NeronLeVelu

Reputation: 10039

# prepare a batch (to avoir fork from awk)
awk -F "," -v env_var="$key" '
    BEGIN {
       print "if [ -r /tmp/MD160.Result ];then rm /tmp/MD160.Result;fi"
       }
    {
    print "echo \"\$( echo -n \047" $2 env_var "\047 | openssl ripemd160 )\" >> /tmp/MD160.Result"
    } ' /test/source.csv > /tmp/MD160.eval

# eval the MD for each line with batch fork (should be faster)
. /tmp/MD160.eval

# take result and adapt for output
awk '
   # load MD160
   FNR == NR { m[NR] = toupper($2); next }
   # set FS to ","
   FNR == 1 { FS = ","; $0 = $0 "" }
   # adapt original line
   { $3 = m[FNR]; print}
   ' /tmp/MD160.Result /test/source.csv   > /ziel.csv

Note:

not tested (so the print need maybe some tuning with escape)
no error treatment (assume everything is ok). I advice to make some test (like inclunding line reference in reply and test in second awk).
fork at batch level will be lot more faster than fork from awk including piping fork, catching the reply
not a specialist of openssl ripemd160 but there is maybe another way to treat element in a bulk process without opening everytime a fork from same file/source

Upvotes: 0

VIPIN KUMAR

Reputation: 3127

you can use GNU Parallel to increase the speed of output by executing the awk command in parallel For explanation check here

cat /test/source.csv | parallel --pipe awk -F "," -v env_var="$key" '{
    tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
    if ( (tmp | getline cksum) > 0 ) {
        $3 = toupper(cksum)
    }
    close(tmp)
    print
}' > /ziel.csv

Upvotes: 1

Another approach to apply RIPEMD in CSV file

Answers (3)

Related Questions