I have a CSV with different columns and one column includes the email. The script should generate a string based on the columns and add it to the CSV as an additional column. At the moment the script runs through each line and forms the string. But what I want is, that the script only builds the string of rows with the same email address. The tricky thing is, I don't know the different email addresses, so I can't hardcoding this and group them dynamically. What could be an approach to build this? This is my initial CSV: 42342;home;2020-01-12;2020-01-13;test@test.com 45235;home;2020-04-12;2020-04-13;test@test.com 68787;photo;2020-05-12;2020-05-13;email@test.com 68787;test;2020-05-12;2020-05-13;moritz@test.com 68787;test;2020-05-12;2020-05-13;moritz@test.com 68787;test;2020-05-12;2020-05-13;moritz@test.com How the CSV should look like at the end: 42342;home;2020-01-12;2020-01-13;test@test.com;home 2020-01-12_2020-01-13 && home 2020-04-12_2020-04-13 45235;home;2020-04-12;2020-04-13;test@test.com;home 2020-01-12_2020-01-13 && home 2020-04-12_2020-04-13 68787;photo;2020-05-12;2020-05-13;email@test.com;photo 2020-05-12_2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 This is my bash script: getPhotosCommand(){ com="" header="ID;DIR;START_DATE;END_DATE" while read line; do IFS=';' read -r -a array <<< "$line" dir=${array[2]} start_date=${array[3]} end_date=${array[4]} newCom="$dir $start_date_$end_date && " com=$com$newCom done < $file_new_photos echo $com }

A two-pass awk solution that keeps the original order of the rows. The first pass is for building the 6th field (for each email), and the second pass is for appending them to the corresponding rows. awk ' BEGIN {FS = OFS = ";"} { if (NR == FNR) { str = $2 " " $3 " " $4 if (arr[$5]) { arr[$5] = arr[$5] " && " str } else { arr[$5] = str } } else { print $0, arr[$5] } } ' file.scsv file.scsv output: 42342;home;2020-01-12;2020-01-13;test@test.com;home 2020-01-12 2020-01-13 && home 2020-04-12 2020-04-13 45235;home;2020-04-12;2020-04-13;test@test.com;home 2020-01-12 2020-01-13 && home 2020-04-12 2020-04-13 68787;photo;2020-05-12;2020-05-13;email@test.com;photo 2020-05-12 2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 68787;test;2020-05-12;2020-05-13;moritz@test.com;test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13

bashshellcsvgrouping

ScienceNoob

Reputation: 231

How to group by email in CSV without knowing the email addresses?

I have a CSV with different columns and one column includes the email. The script should generate a string based on the columns and add it to the CSV as an additional column. At the moment the script runs through each line and forms the string. But what I want is, that the script only builds the string of rows with the same email address. The tricky thing is, I don't know the different email addresses, so I can't hardcoding this and group them dynamically.

What could be an approach to build this?

This is my initial CSV:

42342;home;2020-01-12;2020-01-13;[email protected]
45235;home;2020-04-12;2020-04-13;[email protected]
68787;photo;2020-05-12;2020-05-13;[email protected]
68787;test;2020-05-12;2020-05-13;[email protected]
68787;test;2020-05-12;2020-05-13;[email protected]
68787;test;2020-05-12;2020-05-13;[email protected]

How the CSV should look like at the end:

42342;home;2020-01-12;2020-01-13;[email protected];home 2020-01-12_2020-01-13 && home 2020-04-12_2020-04-13
45235;home;2020-04-12;2020-04-13;[email protected];home 2020-01-12_2020-01-13 && home 2020-04-12_2020-04-13
68787;photo;2020-05-12;2020-05-13;[email protected];photo 2020-05-12_2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13 && test 2020-05-12_2020-05-13

This is my bash script:

getPhotosCommand(){
    com=""
    header="ID;DIR;START_DATE;END_DATE" 
    
    while read line; do
        IFS=';' read -r -a array <<< "$line"

        dir=${array[2]}
        start_date=${array[3]}
        end_date=${array[4]}

        newCom="$dir $start_date_$end_date && "
        com=$com$newCom
    
    done < $file_new_photos
    
    echo $com

}

Upvotes: 1

Answers (3)

Fravadona

Reputation: 17208

A two-pass awk solution that keeps the original order of the rows. The first pass is for building the 6th field (for each email), and the second pass is for appending them to the corresponding rows.

awk '
    BEGIN {FS = OFS = ";"}
    {
        if (NR == FNR) {
            str = $2 " " $3 " " $4
            if (arr[$5]) {
                arr[$5] = arr[$5] " && " str
            } else {
                arr[$5] = str
            }
        } else {
            print $0, arr[$5]
        }
    }
' file.scsv file.scsv

output:

42342;home;2020-01-12;2020-01-13;[email protected];home 2020-01-12 2020-01-13 && home 2020-04-12 2020-04-13
45235;home;2020-04-12;2020-04-13;[email protected];home 2020-01-12 2020-01-13 && home 2020-04-12 2020-04-13
68787;photo;2020-05-12;2020-05-13;[email protected];photo 2020-05-12 2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13
68787;test;2020-05-12;2020-05-13;[email protected];test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13 && test 2020-05-12 2020-05-13

Upvotes: 2

M. Nejat Aydin

Reputation: 10133

Assuming the lines are grouped by email addresses and the file has no blank lines, this bash script may be what you're trying to do.

#!/bin/bash

cnt=0
while IFS= read -r line; do
    email=${line##*;}
    if [ "$email" = "$prev_email" ]; then
        grpline[cnt++]=$line
    else
        if ((cnt)); then
            tail=
            for ((i = 0; i < cnt; ++i)); do
                IFS=';' read -ra fld <<< "${grpline[i]}"
                tail+=" && ${fld[1]} ${fld[2]}_${fld[3]}"
            done
            tail=${tail:4}
            for ((i = 0; i < cnt; ++i)); do
                printf '%s;%s\n' "${grpline[i]}" "$tail"
            done
        fi
        prev_email=$email
        grpline[0]=$line
        cnt=1
    fi
done < <(cat file; echo)

If the file is not grouped by email addresses then replace the cat file with sort -t \; -k5 file.

Upvotes: 0

A. Ocannaille

Reputation: 348

For parsing a file and outputing another one post-treated, I would use awk.

Awk has associatives array (a.k.a. map in most languages) and it may help you doing the trick.

In a nutshell, your awk script should :

in your begin block : initialise your associative array, handle some arguments to set the output in the end block.
in your "line block" : insert current line in the associative array with the email as key
in your end block : generate the new file looping over the associative array.

Have fun =)

Upvotes: 0

How to group by email in CSV without knowing the email addresses?

Answers (3)

Related Questions