Pokecallum
Pokecallum

Reputation: 33

Bash script csv manipulation optimization

I have a 2 million line csv file where what I want to do is replace the second column of each line in the csv file with a unique value to that string, these are all filled with usernames. The long process I've got below does work, but does take a while.

It doesn't have to be hashed, but this seemed like a sure way of when the next file comes along there are no discrepancies.

I'm by no means a coder, and was wondering if there was anyway that I could optimize the process. Although I understand the best way to do this would be in some sort of scripting language.

#!/bin/bash
#Enter Filename to Read
echo "Enter File Name"
read filename
#Extracts Usersnames from file
awk -F "\"*,\"*" '{print $2}' $filename > usernames.txt 
#Hashes Usernames using SHA256      
cat usernames.txt | while read line; do echo -n $line|openssl sha256 |sed      's/^.* //'; done > hashedusernames.txt
#Deletes usernames out of first file
cat hash.csv | cut -d, -f2 --complement > output.txt
#Pastes hashed usernames to end of first file
paste -d , output.txt hashedusernames.txt > output2.txt
#Moves everything back into place
awk -F "\"*,\"*" '{print $1","$4","$2","$3}' output2.txt > final.csv

Example File, there are 7 columns in all but only 3 are shown

Time Username Size
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123

Upvotes: 0

Views: 143

Answers (4)

Rafael Aguilar
Rafael Aguilar

Reputation: 3279

Yet another Python solution, focus on speed but also on maintainability.

#!/usr/bin/python3

import argparse
import hashlib
import re

parser = argparse.ArgumentParser(description='CSV swaper')
parser.add_argument(
    '-f',
    '--file',
    dest='file_path',
    type=str,
    required=True,
    help='The CSV file path.')

def hash_user(users, user):
    try:
        return users[user]
    except KeyError:
        id_ = int(hashlib.md5(user.encode('utf-8')).hexdigest(), 16)
        users[user] = id_
        return id_
def main():
    args = parser.parse_args()
    username_extractor = re.compile(r',([\s\S]*?),')
    users = {}
    counter = 0
    templ = ',{},'
    with open(args.file_path) as file:
        with open('output.csv', 'w') as output:
            line = file.readline()
            while line:
                try:
                    counter += 1
                    if counter == 1:
                        continue
                    username = username_extractor.search(line).groups()[0]
                    hashuser = hash_user(users, username)
                    output.write(username_extractor.sub(
                        templ.format(hashuser), line)
                    )
                except StopIteration:
                    break
                except:
                    print('Malformed line at {}'.format(counter))
                finally:
                    line = file.readline()

if __name__ == '__main__':
    main()

There are still some points that could be optimized, but the central ones are based on do try instead of check, and save users hashes in the case there are repeated users will not have to redigest the username.

Also, will You run this on a multi-core host?.. this can be easily be improved using threads.

Upvotes: 1

tinkertime
tinkertime

Reputation: 3042

Since you used awk in your original attempt, here's a simpler approach in awk

awk -F"," 'BEGIN{i=0;}
           {if (unique_names[$2] == "") {
                unique_names[$2]="Unique"i;
                i++;
            }
            $2=unique_names[$2];
            print $0}'   

Upvotes: 0

simbabque
simbabque

Reputation: 54381

You could do this in Perl easily in a few lines. The following program uses the Crypt::Digest::SHA256, which you need to install from CPAN or from your OS repository if they have it.

The program assumes input from the DATA section, which we typically do around here to include example data in an mcve.

use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';

while (my $line = <DATA>) {
    # no need to chomp because we don't touch the last line
    my @fields = split /,/, $line;
    $fields[1] = sha256_b64u($fields[1]);
    print join ',', @fields;
}

__DATA__
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123

It prints the following output.

2017-01-01T14:53.45,g8EPHWc3L1ln_lfRhq8elyOUgsiJm6BtTtb_GVt945s,12345
2016-01-01T13:42.56,jwXsws2dJq9h_R08zgSIPhufQHr8Au8_RmniTQbEKY4,54312
2015-01-01T12:34.34,mkrKXbM1ZiPiXSSnWYNo13CUyzMF5cdP2SxHGyO7rgQ,54123

To make it read a file that is supplied as a command line argument and write to a new file with the .new extension, you can use it like this:

use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';

open my $fh_in, '<', $ARGV[0] or die $!;
open my $fh_out, '>', "$ARGV[0].new" or die $!;

while (my $line = <$fh_in>) {
    # no need to chomp because we don't touch the last line
    my @fields = split /,/, $line;
    $fields[1] = sha256_b64u($fields[1]);
    print $fh_out join ',', @fields;
}

Run it as follows:

$ perl foo.pl example.csv

Your new file will be named example.csv.new.

Upvotes: 1

Robᵩ
Robᵩ

Reputation: 168886

This Python program might do what you want. You can pass the filenames to convert on the command line:

$ python this_program.py file1.csv file2.csv

 

import fileinput
import csv
import sys
import hashlib


class stdout:
    def write(self, *args):
        sys.stdout.write(*args)

input = fileinput.input(inplace=True, backup=".bak", mode='rb')
reader = csv.reader(input)
writer = csv.writer(stdout())

for row in reader:
    row[1] = hashlib.sha256(row[1]).hexdigest()
    writer.writerow(row)

Upvotes: 0

Related Questions