bash script to return every unique, first occurrence of elements from array

Question

Given array1 I want to find every unique, first occurrence of each csv entry. The array is already ordered by date. So the first occurrence will be the most recent.

array1=(url://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ url://root/sub1/sub2/2022-07-22/ a.csv/)

I want to return an array with every unique, most recent occurrence of each csv entry (with the full paths)

array2=(url://root/sub1/sub2/2022-10-22/a.csv/ url://root/sub1/sub2/2022-10-22/b.csv/ url://root/sub1/sub2/2022-08-22/c.csv/ url://root/sub1/sub2/2022-08-22/d.csv/)

an array of all the duplicate entries (with the full paths)

array3=(url://root/sub1/sub2/2022-09-22/a.csv/ url://root/sub1/sub2/2022-09-22/b.csv/ url://root/sub1/sub2/2022-08-22/a.csv/ url://root/sub1/sub2/2022-08-22/b.csv/ url://root/sub1/sub2/2022-07-22/a.csv/)

My thought process is as follows - Loop through the array, if the element is a url path check the preceding elements and write the url path and csv files to a new array. Stop when the preceding element is another url path. If the following url path contains the same csv files write to a duplicate array. If the following url path contains new csv files append to the new array.

tshiono · Accepted Answer

Would you please try the following:

#!/bin/bash

declare -A seen                                                         # check if the csv element has appeared

array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
array2=(); array3=()

while read -r first others; do                                          # split the line into "s3:.." and others
    read -r -a ary <<< "$others"                                        # split others into list of csv's
    dup=(); new=()                                                      # temporary arrays
    for i in "${ary[@]}"; do                                            # loop over the csv's
        (( seen[$i]++ )) && dup+=( "$i" ) || new+=( "$i" )              # sort the csv's depending on the history
    done

    for i in "${new[@]}"; do                                            # loop over the array of unique entries
        array2+=( "${first}${i}" )                                      # append the full path to array2
    done
    for i in "${dup[@]}"; do                                            # loop over the array of duplicate entries
        array3+=( "${first}${i}" )                                      # append the full path to array3
    done
done < <(sed -E 's# (s3://)#'\$'
''\1#g' <<< "${array1[*]}")          # construct 2-d structure from array1

echo "${array2[@]}"
echo "${array3[@]}"

Output:

s3://root/sub1/sub2/2022-10-22/a.csv/ s3://root/sub1/sub2/2022-10-22/b.csv/ s3://root/sub1/sub2/2022-08-22/c.csv/ s3://root/sub1/sub2/2022-08-22/d.csv/
s3://root/sub1/sub2/2022-09-22/a.csv/ s3://root/sub1/sub2/2022-09-22/b.csv/ s3://root/sub1/sub2/2022-08-22/a.csv/ s3://root/sub1/sub2/2022-08-22/b.csv/ s3://root/sub1/sub2/2022-07-22/a.csv/

As array1 looks like having a 2-d structure, I've first rearranged the elements with sed into:

s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-07-22/ a.csv/

then process them line by line.

bash script to return every unique, first occurrence of elements from array

Answers (1)

Related Questions