Reputation: 147

Sum the frequency for items that present in alternate columns

There are certain part of my data that I would like to further analyse on. The extracted information contained comparison between two items in separate columns and the frequency for each comparison. The order between the two columns in the table were not important.

Extracted info file (input file):

Name1	Name2	Freq
Micractinium	Parachlorella	588
Parachlorella	Planktochlorella	32
Planktochlorella	Parachlorella	78

Expected output: Frequency of each comparison with no duplicate

Name1	Name2	Freq
Micractinium	Parachlorella	588
Parachlorella	Planktochlorella	110

I wanted to check the frequency of each comparison but problem arose when the comparison was duplicated as in the 2nd and 3rd row in the input file, where the comparisons involved the same name, but were in alternate columns.

I was processing the file using shellscript in Linux environment and in Excel initially, and trying to sort each row according to alphabetical order but it doesn't work, not sure what is the problem though.

Referred from here and here:

cat file | while read line; do echo $line | sed 's/\t/\n/g' | sort | gawk '{line=line " " $0} END {print line}' ; done

while read line; do 
      sorted=$(sort -g -- <<< "${line//[, ]/$'\n'}")
      printf -- "${sorted//$'\n'/,}\n"
    done < file

Then I tried to compile it manually but it is too time consuming to do it one by one. The list were up to 500 rows.

Wondering if this kind of task can better executed using R? I can use a bit of R, though not very familiar. Is there any function in R, that can save some time in this process? Or any other idea/suggestion other than R?

Upvotes: 1

Answers (5)

akrun

Reputation: 887223

Using data.table

library(data.table)
setDT(df)[, lapply(.SD, sum),.(Name_1 = pmin(Name1, Name2),
       Name_2 = pmax(Name1, Name2)), .SDcols = patterns("Freq")]

-oputut

    Name_1           Name_2 Freq
1:  Micractinium    Parachlorella  588
2: Parachlorella Planktochlorella  110

data

df <- structure(list(Name1 = c("Micractinium", "Parachlorella", "Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 0

BarathVutukuri

Reputation: 1303

With awk you can do this in a single line.

awk '$1<$2{a[$1FS$2]+=$3;next;}{a[$2FS$1]+=$3;} END{for (key in a){print key,a[key]}}' input.txt

More Readable format below

awk '$1 < $2 {
    a[$1 FS $2] += $3
    next
}

{
    a[$2 FS $1] += $3
}

END {
    for (key in a) {
        print key, a[key]
    }
}' input.txt

Upvotes: 2

ThomasIsCoding

Reputation: 101753

An igraph option

library(igraph)
setNames(
    get.data.frame(
        simplify(
            graph_from_data_frame(df, directed = FALSE),
            edge.attr.comb = "sum"
        )
    ),
    names(df)
)

gives

          Name1            Name2 Freq
1  Micractinium    Parachlorella  588
2 Parachlorella Planktochlorella  110

Upvotes: 3

AnilGoyal

Reputation: 26218

Even I would have also done it using pmin and pmax

df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))



library(dplyr, warn.conflicts = F)
df %>% group_by(Name_1 = pmin(Name1, Name2), Name_2 = pmax(Name1, Name2)) %>%
  summarise(across('Freq', sum), .groups = 'drop')
#> # A tibble: 2 x 3
#>   Name_1        Name_2            Freq
#>   <chr>         <chr>            <int>
#> 1 Micractinium  Parachlorella      588
#> 2 Parachlorella Planktochlorella   110

^{Created on 2021-06-21 by the reprex package (v2.0.0)}

Upvotes: 2

Ronak Shah

Reputation: 389047

In R, you can sort the names in Name1 and Name2 and sum the Freq value.

aggregate(Freq~Name1 + Name2, 
        transform(df, Name1 = pmin(Name1, Name2),  Name2 = pmax(Name1, Name2)), 
        sum, na.rm = TRUE)

#          Name1            Name2 Freq
#1  Micractinium    Parachlorella  588
#2 Parachlorella Planktochlorella  110

data

df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

Sum the frequency for items that present in alternate columns

Answers (5)

data

Related Questions