massisenergy
massisenergy

Reputation: 1820

AWK: Reading all lines & manipulating one file ENTIRELY based each line of another file

I have two input files:

File1.txt:

Name    Latin-small    Roman        Latin-caps #header, not to be processed
F0,        a,              I,            A
F1,        b,              II,           B
F2,        c,              III,          C
F3,        d,              IV,           D

File2.txt:

Lorem ipsum
Roman here.
LCaps here.
LSmall here.
Lorem ipsum
  1. I get assign values of R, LC and LS from each line of File1.txt (line 6 of script.sh).
  2. Generate folders named Fx, where x=0, 1, 2, 3,... using File1.txt (line 7 of script.sh).
  3. Individual files named Fx.txt, generated using File2.txt has to be placed in those folder (line 7 of script.sh).
  4. Now, after reading a signle line of File1.txt, it should read (line 7 of script.sh) & modify the whole File2.txt looking at the keys. <- this is where I cannot make it work, it reads one line of File2.txt for each line of File1.txt.
  5. The contents of those files are basically copies File.txt, except the values of here, modified using the keys Roman ($3 of File1.txt), LCaps ($4 of File1.txt) and LSmall ($2 of File1.txt) for each Fx.txt in each directory, using the values assigned in the first step from File1.txt (line 9-17 of script.sh).

How to get the following output in respective folders (e.g. the output file in Folder F2), using :

cat F0/F0.txt

 Lorem ipsum
 Roman               I.
 LCaps           A.
 LSmall         a.
 Lorem ipsum

or,

cat F3/F3.txt

 Lorem ipsum
 Roman               IV.
 LCaps           D.
 LSmall         d.
 Lorem ipsum

or,

cat F2/F2.txt

 Lorem ipsum
 Roman III.
 LCaps C.
 LSmall c.
 Lorem ipsum

More info: File1 is ~300lines, for each line (except the header), one file is to be created in each folder. File2 is ~200lines. Each of the phrases Roman or LSmall or LC randomly occur in certain lines of File2.txt, but not more than one in one line. These are the keys for modyfying values in `

Thanks in advance! This question is a part of a bigger workflow.

EDIT2: trial code

script.sh

awk 'BEGIN {FS=","}
 {
  if ($1 !~ "F")
    {}
  else if ($1 ~ "F")
    {LS = $2; R = $3; LC = $4;
    system("mkdir "$1); filename=$1"/"$1".txt";
    {(getline < "File2.txt");
      {
        if ($0 ~ "Roman")
          {gsub("here",R); print >> filename;}
        else if ($0 ~ "LSmall")
          {gsub("here",LS); print >> filename;}
        else if ($0 ~ "LCaps")
          {gsub("here",LC); print >> filename;}
        else
          {print >> filename;}
      }
    }
    }
  }
' File1.txt

I'm getting folder and file structure as I need (file Fx.txt in Folder Fx, where x = 0, 1, 2, ...), but content of these files are:

cat F0/F0.txt

Lorem ipsum

cat F1/F1.txt

Roman               II.

cat F2/F2.txt

LCaps           C.

cat F3/F3.txt

LSmall         d.

The key is to make awk read the entire file2.txt, while reading each line of file1 and making modifications and placing the new files in respective folders.

Upvotes: 0

Views: 1184

Answers (1)

tripleee
tripleee

Reputation: 189317

Like you discovered, Awk can really only process one line at a time. But we can turn things around and read the input file into memory, then loop over its lines repeatedly as we read the other file.

Your example has a comma and a space between the items in file1.txt but I assumed this is not a hard requirement, and so this script expects tab-delimited input instead.

awk -F "\t" 'BEGIN { split(":LSmall:Roman:LCaps", k, /:/) }
    NR==FNR { a[NR] = $0; n=NR; next }
    FNR==1 { next }  # skip header
    {
        system("mkdir "$1)
        filename=$1"/"$1".txt"        
        for(i=1; i<=n; i++) {
            line = a[i]
            for (j=2; j<=NF; ++j) {
                if (line ~ k[j]) {
                    gsub(/here/, $j, line)
                    break
                }
        }
        print line >>filename }
    }' file2.txt file1.txt

The BEGIN block initializes an array with substitution key names k. To keep it in sync with the fields in file1.txt, the first item k[1] is empty (it doesn't specify a substitution key).

When NR==FNR we are reading the first input file. We simply collect its lines into the array a.

When we fall through, we are reading the second file, which is the mapping with directory names and substitutions. For each input line, we loop over all the lines in a and perform any substitution specified in the fields in the current line (as soon as one is found, we consider ourselves done. Maybe you want to change this so that multiple keys can trigger on the same line) and finally print the result to the specified output file.

You'll notice how we pull the first field and loop over the subsequent fields, looking up their corresponding key in k by index.

Demo: https://ideone.com/syTv99

If you want to do this on hundreds of files, perhaps refactor some or all of the surrounding loop out into a shell script and concentrate on the substitution actions in the Awk script. The shell can easily loop over the data in file1.txt just as well, which will simplify the Awk script somewhat and make the overall process easier to understand.

# Trim the obnoxious header
tail -n +2 file1.txt |
while read -r directory LSmall Roman LCaps; do
    mkdir "$directory"
    awk -v LSmall="$LSmall" -v Roman="$Roman" -v LCaps="$LCaps" '
        BEGIN { split("LSmall:Roman:LCaps", k, /:/)
            split(LSmall ":" Roman ":" LCaps, r, /:/) }
        {
            for (j=1; j<=3; ++j)
                if ($0 ~ k[j]) {
                    gsub(/here/, r[j])
                    break
                }
        }1' file2.txt >"$directory"/"$directory".txt
done

Demo: https://ideone.com/RUhsUS

Upvotes: 1

Related Questions