Reputation: 907
Haven't seen a solution similar enough to this yet...
I have two files each containing a list of file names. There are overlap in the contents of the files but file A contain some file names that are not in file B. Also, the file extensions are different in files A and B. That is:
A B
------------ --------------
file-1-2.txt file-1-2.png
file-2-3.txt file-3-4.png
file-3-4.txt
...
How do I combine the two files, comma-delimited, into one ignoring lines that don't match?
That is:
C
------------
file-1-2.txt,file-1-2.png
file-3-4.txt,file-3-4.png
I believe some usage of awk
similar to the following will work:
awk 'FNR==NR{NOT SURE} {print $1,$2}' fileA fileB
Thanks in advance!
Upvotes: 1
Views: 1207
Reputation: 8617
This pure bash solution should work and handle dots, backslashes, dashes, and other special characters in either file.
mapfile -t arr_a < A
mapfile -t arr_b < B
for a in "${arr_a[@]}"; do for b in "${arr_b[@]}"; do
[[ ${a%.*} == "${b%.*}" ]] && printf '%s,%s\n' "$a" "$b" && break
done; done
First, we read the contents of the files into arrays, one line per item, using mapfile
. 1 Then, for each line in A
, we compare to each line in B
.
To compare only the portion before the extension, we use the shell parameter expansion ${var%pattern}
, which removes the shortest match of the glob .*
2 from the end of the filenames.
1The -t option strips the trailing newline from the array items.
2The .
here is literal, removing a period and everything after.
Upvotes: 2
Reputation: 103754
You could do:
$ awk 'function base(fn) {sub("[.][^.]*$", "", fn); return fn}
NR==FNR { fn[$1]; next}
{for (e in fn){ if (base(e)==base($1)){ printf "%s,%s\n", e, $1 }}} ' f1 f2
file-1-2.txt,file-1-2.png
file-3-4.txt,file-3-4.png
Since awk
associative arrays are unordered, the order of the printout is determined by the order of the second file -- not the first.
Explanation:
function base(fn) {sub("[.][^.]*$", "", fn); return fn}
is a function that strips the extension from the filename (assuming that the extension is the non .
characters to the right of the last .
found. The entire name is returned if no .
is found.)NR==FNR { fn[$1]; next}
read each line (each file name in this case) into an associative array. The NR==FNR
is an awk
idiom that is true only for the first file and next
means the only this part is executed on the first file of file names. $1
is used since the leading and trailing spaces are stripped. Since Unix filenames can have leading or trailing spaces, this is a rare ambiguity you need to resolve. If you don't want the lines stripped, you would use $0
instead.{for (e in fn){ if (base(e)==base($1)){ printf "%s,%s\n", e, $1 }}}
now for any line other than from the first file (where NR==FNR
is true since next
skipped this part) loop through the saved file names. Print if the base name is the same. Upvotes: 1
Reputation: 343
Here's something fairly brute force:
file1="file1.txt"
file2="file2.txt"
out_file="out.txt"
touch $out_file
while read line ; do # read the first file line by line
file1_name="$(echo "$line" | cut -d'.' -f1)" # get the filename without extension
file2_name="$(grep "$file1_name\." $file2)"
if [ -n "$file2_name" ]; then #did we find a match
echo "$line,$file2_name" >> $out_file
else
echo "Did not find a match to ${line} in $file2"
fi
done < $file1
We loop through file1 and look for matches in file 2. If found, we output to the output file.
Other improvements: a better grep using regexp:
file2_name="$(grep -e "$file1_name\.[^.]*$" $file2)"
This looks for a line that starts with $file1_name
, a dot .
and then no more dots till the end which is the extension.
Upvotes: 0
Reputation: 1273
The unix join command should do what you want. Set the field separator -t '.' to be a dot and join by the first column in both files. You may need to sort the files ahead of time. The sort can be done on the same command line as the join with the proper syntax. <(sort -k 2 file1.txt) <(sort file2.txt)
Upvotes: 0