kadakyo
kadakyo

Reputation: 21

How Should I modify a Big File with Specific Column and Value Using awk?

Here I have a dataset that includes header and separator, like this:

a|b|c|d|e|f|g
1|2|3|4|5|6|5
2|4|2|3|5|2|1

And another config file with some columns and values, like this:

b:5
d:6

My aim is to modify the dataset with the config file. The result is like this:

a|b|c|d|e|f|g    
1|5|3|6|5|6|5
2|5|2|6|5|2|1

Without using "for" outside awk, how can I complete the process?

Upvotes: 0

Views: 137

Answers (2)

Tom Fenech
Tom Fenech

Reputation: 74615

Here's how you could do it using awk:

awk '
     NR == FNR { rep[$1] = $2; next } 
     FNR == 1 { for (i = 1; i <= NF; ++i) if ($i in rep) cols[i] = rep[$i] }
     FNR > 1 { for (i in cols) $i = cols[i] }
     1
' FS=':' replacements FS='|' OFS='|' dataset
  • first, save all of the key:value replacements into an array rep
    • target the first file using the standard NR == FNR (total line number equals this file's line number)
    • skip the rest of the script with next
  • for the first line of the dataset (second file), work out which columns contain the headers, saving them to cols along with their replacements
  • for the rest of the lines of the dataset, substitute the columns with their replacement values
  • print all lines of the second file using the condition 1 (always true), which triggers the default action { print }

Note that since the two files have different separators, these are specified as arguments after the awk script. FS defines the input Field Separator and OFS defines the Output Field Separator for the next filename argument. The arguments should be read like:

# read the file 'replacements' with input field separator set to ':'
FS=':' replacements
# read the file 'dataset' with input and output field separator set to '|'
FS='|' OFS='|' dataset

Testing it out

$ cat replacements 
b:5
d:6
$ cat dataset 
a|b|c|d|e|f|g
1|2|3|4|5|6|5
2|4|2|3|5|2|1
$ awk '
>      NR == FNR { rep[$1] = $2; next } 
>      FNR == 1 { for (i = 1; i <= NF; ++i) if ($i in rep) cols[i] = rep[$i] }
>      FNR > 1 { for (i in cols) $i = cols[i] }
>      1
> ' FS=':' replacements FS='|' OFS='|' dataset
a|b|c|d|e|f|g
1|5|3|6|5|6|5
2|5|2|6|5|2|1

Upvotes: 4

bipll
bipll

Reputation: 11940

It is probably wisest to do in the following order. First, you parse the config (assuming GNU dialect of awk):

gawk -F \| -v OFS=\| 'NR == FNR { # this pattern trigs inside the first file
    split($0, mapping, /:/)
    rules[mapping[1]] = mapping[2]
    next # short-circuit to skip other blocks
}

Next, on the first line of the data file you need to parse column headers:

FNR == 1 {
    for(i = 1; i <= NF; ++i) if($i in rules) forcedValues[i] = rules[$i]
    print
    next
}

Now you have an array forcedValues that, for some column numbers from 1 through 7 (in your example), contains values those columns should be reset to. So now you process the rest of the file:

{
    for(i in forcedValues) $i = forcedValues[i]
    print
}' config.txt input.txt > output.txt

(The three code snippets in this post are actually parts of a single shell command and should be concatenated via newlines.)

Upvotes: 1

Related Questions