Reputation: 25
I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
Upvotes: 1
Views: 1287
Reputation: 5395
You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",@F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5
is a core module, any perl installation should have it-M'Digest::MD5 qw(md5_hex)'
- this loads the md5_hex function-l
handle line endings-F'\|'
- autosplit fields on | (this implies -a and -n)2..eof
- range operator (or flip-flop as some want to call it) - true between line 2 and end of the file$F[$_] = md5_hex($F[$_])
- replace field $_ with it's md5 sumfor (1,3)
- statement modifier runs the statement for 1 and 3 aliasing $_ to themprint join "|",@F
- print the modified fields else { print }
- this hanldes the headerNote about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",@F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
/bin/echo
because there are some variants of shell where echo doesn't have -n-n
mostly because I want to be able to compare the results with the perl resultssubstr(l,0,32)
- on my machine openssl md5
doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant partAs I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl
tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso@localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers
Upvotes: 0
Reputation: 784868
You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6
Upvotes: 2
Reputation: 52336
Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -
Upvotes: 1
Reputation: 9845
You have to specify |
as input and output field separators. Otherwise $2
is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22
Upvotes: 1