web
web

Reputation: 147

Sum the frequency for items that present in alternate columns

There are certain part of my data that I would like to further analyse on. The extracted information contained comparison between two items in separate columns and the frequency for each comparison. The order between the two columns in the table were not important.

Extracted info file (input file):

Name1 Name2 Freq
Micractinium Parachlorella 588
Parachlorella Planktochlorella 32
Planktochlorella Parachlorella 78

Expected output: Frequency of each comparison with no duplicate

Name1 Name2 Freq
Micractinium Parachlorella 588
Parachlorella Planktochlorella 110

I wanted to check the frequency of each comparison but problem arose when the comparison was duplicated as in the 2nd and 3rd row in the input file, where the comparisons involved the same name, but were in alternate columns.

I was processing the file using shellscript in Linux environment and in Excel initially, and trying to sort each row according to alphabetical order but it doesn't work, not sure what is the problem though.

Referred from here and here:

cat file | while read line; do echo $line | sed 's/\t/\n/g' | sort | gawk '{line=line " " $0} END {print line}' ; done

,

while read line; do 
      sorted=$(sort -g -- <<< "${line//[, ]/$'\n'}")
      printf -- "${sorted//$'\n'/,}\n"
    done < file

Then I tried to compile it manually but it is too time consuming to do it one by one. The list were up to 500 rows.

Wondering if this kind of task can better executed using R? I can use a bit of R, though not very familiar. Is there any function in R, that can save some time in this process? Or any other idea/suggestion other than R?

Upvotes: 1

Views: 73

Answers (5)

akrun
akrun

Reputation: 887223

Using data.table

library(data.table)
setDT(df)[, lapply(.SD, sum),.(Name_1 = pmin(Name1, Name2),
       Name_2 = pmax(Name1, Name2)), .SDcols = patterns("Freq")]

-oputut

    Name_1           Name_2 Freq
1:  Micractinium    Parachlorella  588
2: Parachlorella Planktochlorella  110

data

df <- structure(list(Name1 = c("Micractinium", "Parachlorella", "Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 0

BarathVutukuri
BarathVutukuri

Reputation: 1303

With awk you can do this in a single line.

awk '$1<$2{a[$1FS$2]+=$3;next;}{a[$2FS$1]+=$3;} END{for (key in a){print key,a[key]}}' input.txt

More Readable format below

awk '$1 < $2 {
    a[$1 FS $2] += $3
    next
}

{
    a[$2 FS $1] += $3
}

END {
    for (key in a) {
        print key, a[key]
    }
}' input.txt


Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101753

An igraph option

library(igraph)
setNames(
    get.data.frame(
        simplify(
            graph_from_data_frame(df, directed = FALSE),
            edge.attr.comb = "sum"
        )
    ),
    names(df)
)

gives

          Name1            Name2 Freq
1  Micractinium    Parachlorella  588
2 Parachlorella Planktochlorella  110

Upvotes: 3

AnilGoyal
AnilGoyal

Reputation: 26218

Even I would have also done it using pmin and pmax

df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))



library(dplyr, warn.conflicts = F)
df %>% group_by(Name_1 = pmin(Name1, Name2), Name_2 = pmax(Name1, Name2)) %>%
  summarise(across('Freq', sum), .groups = 'drop')
#> # A tibble: 2 x 3
#>   Name_1        Name_2            Freq
#>   <chr>         <chr>            <int>
#> 1 Micractinium  Parachlorella      588
#> 2 Parachlorella Planktochlorella   110

Created on 2021-06-21 by the reprex package (v2.0.0)

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389047

In R, you can sort the names in Name1 and Name2 and sum the Freq value.

aggregate(Freq~Name1 + Name2, 
        transform(df, Name1 = pmin(Name1, Name2),  Name2 = pmax(Name1, Name2)), 
        sum, na.rm = TRUE)

#          Name1            Name2 Freq
#1  Micractinium    Parachlorella  588
#2 Parachlorella Planktochlorella  110

data

df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

Related Questions