Reputation: 33
I have a 2 million line csv file where what I want to do is replace the second column of each line in the csv file with a unique value to that string, these are all filled with usernames. The long process I've got below does work, but does take a while.
It doesn't have to be hashed, but this seemed like a sure way of when the next file comes along there are no discrepancies.
I'm by no means a coder, and was wondering if there was anyway that I could optimize the process. Although I understand the best way to do this would be in some sort of scripting language.
#!/bin/bash
#Enter Filename to Read
echo "Enter File Name"
read filename
#Extracts Usersnames from file
awk -F "\"*,\"*" '{print $2}' $filename > usernames.txt
#Hashes Usernames using SHA256
cat usernames.txt | while read line; do echo -n $line|openssl sha256 |sed 's/^.* //'; done > hashedusernames.txt
#Deletes usernames out of first file
cat hash.csv | cut -d, -f2 --complement > output.txt
#Pastes hashed usernames to end of first file
paste -d , output.txt hashedusernames.txt > output2.txt
#Moves everything back into place
awk -F "\"*,\"*" '{print $1","$4","$2","$3}' output2.txt > final.csv
Example File, there are 7 columns in all but only 3 are shown
Time Username Size
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
Upvotes: 0
Views: 143
Reputation: 3279
Yet another Python solution, focus on speed but also on maintainability.
#!/usr/bin/python3
import argparse
import hashlib
import re
parser = argparse.ArgumentParser(description='CSV swaper')
parser.add_argument(
'-f',
'--file',
dest='file_path',
type=str,
required=True,
help='The CSV file path.')
def hash_user(users, user):
try:
return users[user]
except KeyError:
id_ = int(hashlib.md5(user.encode('utf-8')).hexdigest(), 16)
users[user] = id_
return id_
def main():
args = parser.parse_args()
username_extractor = re.compile(r',([\s\S]*?),')
users = {}
counter = 0
templ = ',{},'
with open(args.file_path) as file:
with open('output.csv', 'w') as output:
line = file.readline()
while line:
try:
counter += 1
if counter == 1:
continue
username = username_extractor.search(line).groups()[0]
hashuser = hash_user(users, username)
output.write(username_extractor.sub(
templ.format(hashuser), line)
)
except StopIteration:
break
except:
print('Malformed line at {}'.format(counter))
finally:
line = file.readline()
if __name__ == '__main__':
main()
There are still some points that could be optimized, but the central ones are based on do try
instead of check, and save users hashes in the case there are repeated users will not have to redigest the username.
Also, will You run this on a multi-core host?.. this can be easily be improved using threads.
Upvotes: 1
Reputation: 3042
Since you used awk
in your original attempt, here's a simpler approach in awk
awk -F"," 'BEGIN{i=0;}
{if (unique_names[$2] == "") {
unique_names[$2]="Unique"i;
i++;
}
$2=unique_names[$2];
print $0}'
Upvotes: 0
Reputation: 54381
You could do this in Perl easily in a few lines. The following program uses the Crypt::Digest::SHA256, which you need to install from CPAN or from your OS repository if they have it.
The program assumes input from the DATA
section, which we typically do around here to include example data in an mcve.
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
while (my $line = <DATA>) {
# no need to chomp because we don't touch the last line
my @fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print join ',', @fields;
}
__DATA__
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
It prints the following output.
2017-01-01T14:53.45,g8EPHWc3L1ln_lfRhq8elyOUgsiJm6BtTtb_GVt945s,12345
2016-01-01T13:42.56,jwXsws2dJq9h_R08zgSIPhufQHr8Au8_RmniTQbEKY4,54312
2015-01-01T12:34.34,mkrKXbM1ZiPiXSSnWYNo13CUyzMF5cdP2SxHGyO7rgQ,54123
To make it read a file that is supplied as a command line argument and write to a new file with the .new
extension, you can use it like this:
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
open my $fh_in, '<', $ARGV[0] or die $!;
open my $fh_out, '>', "$ARGV[0].new" or die $!;
while (my $line = <$fh_in>) {
# no need to chomp because we don't touch the last line
my @fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print $fh_out join ',', @fields;
}
Run it as follows:
$ perl foo.pl example.csv
Your new file will be named example.csv.new
.
Upvotes: 1
Reputation: 168886
This Python program might do what you want. You can pass the filenames to convert on the command line:
$ python this_program.py file1.csv file2.csv
import fileinput
import csv
import sys
import hashlib
class stdout:
def write(self, *args):
sys.stdout.write(*args)
input = fileinput.input(inplace=True, backup=".bak", mode='rb')
reader = csv.reader(input)
writer = csv.writer(stdout())
for row in reader:
row[1] = hashlib.sha256(row[1]).hexdigest()
writer.writerow(row)
Upvotes: 0