Reputation: 25

How to hash particular column in csv file | linux |

I have a scenario

where i want to hash some columns of csv file

how to do that with below data

ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22

The Column name NAME | AGE should get hashed with random values

like below output

ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940

When i tested this code below code gives me same value for the column 'NAME' it should give randomized values

awk '{
    tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv

output :

 68b329da9893e34099c7d8ad5cb9c940
 68b329da9893e34099c7d8ad5cb9c940
 68b329da9893e34099c7d8ad5cb9c940
 68b329da9893e34099c7d8ad5cb9c940
 68b329da9893e34099c7d8ad5cb9c940
 68b329da9893e34099c7d8ad5cb9c940

Upvotes: 1

Answers (4)

Sorin

Reputation: 5395

You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).

Perl has support for md5 out of the box:

perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { 
       $F[$_] = md5_hex($F[$_]) for (1,3);
       print join "|",@F 
    } else { print }'

online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)

Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",@F - print the modified fields
else { print } - this hanldes the header

Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)

time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",@F } else { print }' <sample2.txt > out4.txt 

real    0m0.121s
user    0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt

real    1m14.205s
user    0m50.405s
sys 0m35.340s

md5.awk defines the md5 function as such:

$ cat md5.awk 
function md5(str, cmd, l, hex) {
    cmd= "/bin/echo -n "str" | openssl md5 -r"
    if ( ( cmd | getline l) > 0 ) 
        hex = substr(l,0,32)
    close(cmd)
    return hex
}

I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy

As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.

$ cat md5memo.awk 
function md5(str, cmd, l, hex) {
    if (cache[str]) 
        return cache[str]
    cmd= "/bin/echo -n "str" | openssl md5 -r"
    if ( ( cmd | getline l) > 0 ) 
        hex = substr(l,0,32)
    close(cmd)
    cache[str] = hex
    return hex
}

With the above caching, the results improve dramatically:

$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt

real    0m0.192s
user    0m0.141s
sys 0m0.085s
[savuso@localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt

real    0m0.281s
user    0m0.222s
sys 0m0.088s

however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!

Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers

Upvotes: 0

anubhava

Reputation: 784868

You may use it like this:

awk 'function hash(s, cmd, hex, line) {
   cmd = "openssl md5 <<< \"" s "\""
   if ( (cmd | getline line) > 0)
      hex = line
   close(cmd)
   return hex
}
BEGIN {
   FS = OFS = "|"
}
NR == 1 {
   print
   next
}
{
   print $1, hash($2), $3, hash($4)
}' file

ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6

Upvotes: 2

Shawn

Reputation: 52336

Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:

$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9

Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:

$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7  -
$ echo -n AB1  | md5sum
1109867462b2f0f0470df8386036243c  -

Upvotes: 1

Bodo

Reputation: 9845

You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.

awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
    tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv

prints

ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22

Upvotes: 1

How to hash particular column in csv file | linux |

Answers (4)

Related Questions