vngrd1
vngrd1

Reputation: 33

Find part of a string in CSV and replace whole cell with new entry?

I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern". I made a list of keywords that I want to use as my "pattern" bank; So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".

so for example:

my target file:

id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9

my "pattern" file:

Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood

output file:

id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9

I came up with the usual "find and replace" sed:

sed -i 's/PATTERN/REPLACE/g' file.csv

but I want it to run on a specific column, so I came up with:

awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv

but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.

Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)

Upvotes: 3

Views: 711

Answers (4)

James Brown
James Brown

Reputation: 37404

In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):

$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next }          # store "patterns" to a arr
        { for(i in a)            # go thru whole a for each record
              if(index($2,i))    # if "pattern" matches $2
                  print $1,i,$3  # print with replacement
        }
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9

Upvotes: 2

Walter A
Walter A

Reputation: 20002

When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like

sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile

You will need each pattern twice, you can translate the patternfile with

sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile

When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like

sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g'  patternfile

You can store this is a file and use the file, but with process substitution you can do things like

cat <(echo "Now this line from echo is handled as a file")

Nice. Lets test the solution

sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#'  patternfile) inputfile

Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:

sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile

Upvotes: 1

tinkertime
tinkertime

Reputation: 3042

Here's a (mostly) awk solution:

#/bin/bash

patterns_regex=`cat patterns_file  | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
    OFS=",";
    split(patterns, patterns_split, "|");
}

{
    for (pattern_num in patterns_split) {
        pattern=patterns_split[pattern_num];
        if (pattern != "" && $2 ~ pattern) {
            print $1,pattern,$3
        }
    }
}'

Upvotes: 1

choroba
choroba

Reputation: 241898

Perl solution, using Text::CSV_XS:

#!/usr/bin/perl
use warnings;
use strict;

use Text::CSV_XS qw{ csv };

my ($input_file, $pattern_file) = @ARGV;

open my $pfh, '<', $pattern_file or die $!;
chomp( my @patterns = <$pfh> );

my $aoa = csv(in => $input_file);
for my $line (@$aoa) {
    for my $pattern (@patterns) {
        if (-1 != index $line->[1], $pattern) {
            $line->[1] = $pattern;
            last
        }
    }
}

csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);

Upvotes: 1

Related Questions