Reputation: 147
There are certain part of my data that I would like to further analyse on. The extracted information contained comparison between two items in separate columns and the frequency for each comparison. The order between the two columns in the table were not important.
Extracted info file (input file):
Name1 | Name2 | Freq |
---|---|---|
Micractinium | Parachlorella | 588 |
Parachlorella | Planktochlorella | 32 |
Planktochlorella | Parachlorella | 78 |
Expected output: Frequency of each comparison with no duplicate
Name1 | Name2 | Freq |
---|---|---|
Micractinium | Parachlorella | 588 |
Parachlorella | Planktochlorella | 110 |
I wanted to check the frequency of each comparison but problem arose when the comparison was duplicated as in the 2nd and 3rd row in the input file, where the comparisons involved the same name, but were in alternate columns.
I was processing the file using shellscript in Linux environment and in Excel initially, and trying to sort each row according to alphabetical order but it doesn't work, not sure what is the problem though.
cat file | while read line; do echo $line | sed 's/\t/\n/g' | sort | gawk '{line=line " " $0} END {print line}' ; done
,
while read line; do
sorted=$(sort -g -- <<< "${line//[, ]/$'\n'}")
printf -- "${sorted//$'\n'/,}\n"
done < file
Then I tried to compile it manually but it is too time consuming to do it one by one. The list were up to 500 rows.
Wondering if this kind of task can better executed using R? I can use a bit of R, though not very familiar. Is there any function in R, that can save some time in this process? Or any other idea/suggestion other than R?
Upvotes: 1
Views: 73
Reputation: 887223
Using data.table
library(data.table)
setDT(df)[, lapply(.SD, sum),.(Name_1 = pmin(Name1, Name2),
Name_2 = pmax(Name1, Name2)), .SDcols = patterns("Freq")]
-oputut
Name_1 Name_2 Freq
1: Micractinium Parachlorella 588
2: Parachlorella Planktochlorella 110
df <- structure(list(Name1 = c("Micractinium", "Parachlorella", "Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 0
Reputation: 1303
With awk
you can do this in a single line.
awk '$1<$2{a[$1FS$2]+=$3;next;}{a[$2FS$1]+=$3;} END{for (key in a){print key,a[key]}}' input.txt
More Readable format below
awk '$1 < $2 {
a[$1 FS $2] += $3
next
}
{
a[$2 FS $1] += $3
}
END {
for (key in a) {
print key, a[key]
}
}' input.txt
Upvotes: 2
Reputation: 101753
An igraph
option
library(igraph)
setNames(
get.data.frame(
simplify(
graph_from_data_frame(df, directed = FALSE),
edge.attr.comb = "sum"
)
),
names(df)
)
gives
Name1 Name2 Freq
1 Micractinium Parachlorella 588
2 Parachlorella Planktochlorella 110
Upvotes: 3
Reputation: 26218
Even I would have also done it using pmin
and pmax
df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))
library(dplyr, warn.conflicts = F)
df %>% group_by(Name_1 = pmin(Name1, Name2), Name_2 = pmax(Name1, Name2)) %>%
summarise(across('Freq', sum), .groups = 'drop')
#> # A tibble: 2 x 3
#> Name_1 Name_2 Freq
#> <chr> <chr> <int>
#> 1 Micractinium Parachlorella 588
#> 2 Parachlorella Planktochlorella 110
Created on 2021-06-21 by the reprex package (v2.0.0)
Upvotes: 2
Reputation: 389047
In R, you can sort the names in Name1
and Name2
and sum
the Freq
value.
aggregate(Freq~Name1 + Name2,
transform(df, Name1 = pmin(Name1, Name2), Name2 = pmax(Name1, Name2)),
sum, na.rm = TRUE)
# Name1 Name2 Freq
#1 Micractinium Parachlorella 588
#2 Parachlorella Planktochlorella 110
data
df <- structure(list(Name1 = c("Micractinium", "Parachlorella","Planktochlorella"
), Name2 = c("Parachlorella", "Planktochlorella", "Parachlorella"
), Freq = c(588L, 32L, 78L)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 2