AKW
AKW

Reputation: 907

Bash combining two text files based on regex match

Haven't seen a solution similar enough to this yet...

I have two files each containing a list of file names. There are overlap in the contents of the files but file A contain some file names that are not in file B. Also, the file extensions are different in files A and B. That is:

A                     B
------------          --------------
file-1-2.txt          file-1-2.png
file-2-3.txt          file-3-4.png
file-3-4.txt
...

How do I combine the two files, comma-delimited, into one ignoring lines that don't match?

That is:

C
------------
file-1-2.txt,file-1-2.png
file-3-4.txt,file-3-4.png

I believe some usage of awk similar to the following will work:

awk 'FNR==NR{NOT SURE} {print $1,$2}' fileA fileB

Thanks in advance!

Upvotes: 1

Views: 1207

Answers (4)

Grisha Levit
Grisha Levit

Reputation: 8617

This pure bash solution should work and handle dots, backslashes, dashes, and other special characters in either file.

mapfile -t arr_a < A
mapfile -t arr_b < B

for a in "${arr_a[@]}"; do for b in "${arr_b[@]}"; do
    [[ ${a%.*} == "${b%.*}" ]] && printf '%s,%s\n' "$a" "$b" && break
done; done

First, we read the contents of the files into arrays, one line per item, using mapfile. 1 Then, for each line in A, we compare to each line in B.

To compare only the portion before the extension, we use the shell parameter expansion ${var%pattern}, which removes the shortest match of the glob .*2 from the end of the filenames.

1The -t option strips the trailing newline from the array items.

2The . here is literal, removing a period and everything after.

Upvotes: 2

dawg
dawg

Reputation: 103754

You could do:

$ awk 'function base(fn) {sub("[.][^.]*$", "", fn); return fn} 
       NR==FNR { fn[$1]; next} 
       {for (e in fn){ if (base(e)==base($1)){ printf "%s,%s\n", e, $1 }}} ' f1 f2
file-1-2.txt,file-1-2.png
file-3-4.txt,file-3-4.png

Since awk associative arrays are unordered, the order of the printout is determined by the order of the second file -- not the first.


Explanation:

  1. function base(fn) {sub("[.][^.]*$", "", fn); return fn} is a function that strips the extension from the filename (assuming that the extension is the non . characters to the right of the last . found. The entire name is returned if no . is found.)
  2. NR==FNR { fn[$1]; next} read each line (each file name in this case) into an associative array. The NR==FNR is an awk idiom that is true only for the first file and next means the only this part is executed on the first file of file names. $1 is used since the leading and trailing spaces are stripped. Since Unix filenames can have leading or trailing spaces, this is a rare ambiguity you need to resolve. If you don't want the lines stripped, you would use $0 instead.
  3. {for (e in fn){ if (base(e)==base($1)){ printf "%s,%s\n", e, $1 }}} now for any line other than from the first file (where NR==FNR is true since next skipped this part) loop through the saved file names. Print if the base name is the same.

Upvotes: 1

MasterCheffinator
MasterCheffinator

Reputation: 343

Here's something fairly brute force:

file1="file1.txt"
file2="file2.txt"
out_file="out.txt"
touch $out_file
while read line ; do  # read the first file line by line
  file1_name="$(echo "$line" | cut -d'.' -f1)"    # get the filename without extension
  file2_name="$(grep "$file1_name\." $file2)"
  if [ -n "$file2_name" ]; then   #did we find a match
    echo "$line,$file2_name" >> $out_file
  else
    echo "Did not find a match to ${line} in $file2"
  fi
done < $file1 

We loop through file1 and look for matches in file 2. If found, we output to the output file.

Other improvements: a better grep using regexp:

file2_name="$(grep -e "$file1_name\.[^.]*$" $file2)"

This looks for a line that starts with $file1_name, a dot . and then no more dots till the end which is the extension.

Upvotes: 0

Mike Wodarczyk
Mike Wodarczyk

Reputation: 1273

The unix join command should do what you want. Set the field separator -t '.' to be a dot and join by the first column in both files. You may need to sort the files ahead of time. The sort can be done on the same command line as the join with the proper syntax. <(sort -k 2 file1.txt) <(sort file2.txt)

Upvotes: 0

Related Questions