Reputation: 135
I work with a real large dataset, where it is very difficult to look at all columns individually. At this time I only want to count the frequency of the information provided. Lets say I have a dataframe which looks like this:
|ID |Symptome1|Symptome2|Symptome3 |Symptome4|
|123|Rash |Vomiting |Cough |Pain|
|345|Urticaria|Rash |Angioedema|Pain|
The symptoms occur at different time intervals, so I want to calculate the number of symptoms at each time interval. Symptome1 is time interval 1, Symptome2 at time interval 2, etc. Now I want to count all Symptome1, all Samyptome2, all Symptome3 etc. separately.
In a small dataset I would probably use
dataset%>%count(Symptome1)
Of course I could do it for all columns separately, but I guess I need something like a loop?
Upvotes: 3
Views: 92
Reputation: 102529
Here is another way of presenting the results
table(
reshape(
dat,
idvar = "ID",
varying = -1,
direction = "long",
v.names = "Symptome",
timevar = "SymptomeID"
)
)
which gives a high-dimensional array without missing any information
, , Symptome = Angioedema
SymptomeID
ID 1 2 3 4
123 0 0 0 0
345 0 0 1 0
, , Symptome = Cough
SymptomeID
ID 1 2 3 4
123 0 0 1 0
345 0 0 0 0
, , Symptome = Pain
SymptomeID
ID 1 2 3 4
123 0 0 0 1
345 0 0 0 1
, , Symptome = Rash
SymptomeID
ID 1 2 3 4
123 1 0 0 0
345 0 1 0 0
, , Symptome = Urticaria
SymptomeID
ID 1 2 3 4
123 0 0 0 0
345 1 0 0 0
, , Symptome = Vomiting
SymptomeID
ID 1 2 3 4
123 0 1 0 0
345 0 0 0 0
On top of the above high-dimensional format, you can slice the array from an interest aspect, e.g., Symptome
table(
reshape(
dat,
idvar = "ID",
varying = -1,
direction = "long",
v.names = "Symptome",
timevar = "SymptomeID"
)
) %>% asplit("Symptome")
and then you will obtain a list
$Angioedema
SymptomeID
ID 1 2 3 4
123 0 0 0 0
345 0 0 1 0
$Cough
SymptomeID
ID 1 2 3 4
123 0 0 1 0
345 0 0 0 0
$Pain
SymptomeID
ID 1 2 3 4
123 0 0 0 1
345 0 0 0 1
$Rash
SymptomeID
ID 1 2 3 4
123 1 0 0 0
345 0 1 0 0
$Urticaria
SymptomeID
ID 1 2 3 4
123 0 0 0 0
345 1 0 0 0
$Vomiting
SymptomeID
ID 1 2 3 4
123 0 1 0 0
345 0 0 0 0
Upvotes: 1
Reputation: 7979
We have stack()
. With table()
:
> table(stack(dat[-1]))
ind
values Symptome1 Symptome2 Symptome3 Symptome4
Angioedema 0 0 1 0
Cough 0 0 1 0
Pain 0 0 0 2
Rash 1 1 0 0
Urticaria 1 0 0 0
Vomiting 0 1 0 0
Or aggregate()
:
> X = stack(dat[-1])
> X$n = seq(nrow(X))
> aggregate(n ~ ., X, FUN = length)
values ind n
1 Rash Symptome1 1
2 Urticaria Symptome1 1
3 Rash Symptome2 1
4 Vomiting Symptome2 1
5 Angioedema Symptome3 1
6 Cough Symptome3 1
7 Pain Symptome4 2
dat
is stolen from SamR's answer.
Upvotes: 2
Reputation: 17656
In base R you can use lapply
and make a table out of each column. This will store each column in a list:
lapply(dat[-1], table)
# Note if you have additional columns and only want the ones that have "Symptome" (sic), you can use:
lapply(dat[grep("Symptome", names(dat))], table)
If you want them in a flat format, you can use sapply
and use labels = unique(unlist(dat[-1]))
to ensure you've covered all the possible symptoms in the overall dataset:
sapply(dat[-1], function(x)
table(factor(x, levels = unique(unlist(dat[-1]))))
)
Output:
# Symptome1 Symptome2 Symptome3 Symptome4
# Rash 1 1 0 0
# Urticaria 1 0 0 0
# Vomiting 0 1 0 0
# Cough 0 0 1 0
# Angioedema 0 0 1 0
# Pain 0 0 0 2
Data:
# using data from SamR's answer (thanks @SamR!)
dat <- structure(list(ID = c(123L, 345L), Symptome1 = c("Rash", "Urticaria"
), Symptome2 = c("Vomiting", "Rash"), Symptome3 = c("Cough",
"Angioedema"), Symptome4 = c("Pain", "Pain")), class = "data.frame", row.names = c(NA, -2L))
Per your comment, if you want to make a barplot:
xx <- sapply(dat[-1], function(x)
table(factor(x, levels = unique(unlist(dat[-1]))))
)
barplot(xx, beside = TRUE,
legend.text = TRUE,
args.legend = list(x = "topleft", bty = "n"))
Upvotes: 2
Reputation: 20494
I think easier to do this by putting the data in long form first using tidyr::pivot_longer()
.
dat |>
tidyr::pivot_longer(-ID, names_to = "Symptome") |>
count(Symptome, value)
# # A tibble: 7 × 3
# Symptome value n
# <chr> <chr> <int>
# 1 Symptome1 Rash 1
# 2 Symptome1 Urticaria 1
# 3 Symptome2 Rash 1
# 4 Symptome2 Vomiting 1
# 5 Symptome3 Angioedema 1
# 6 Symptome3 Cough 1
# 7 Symptome4 Pain 2
dat <- structure(list(ID = c(123L, 345L), Symptome1 = c("Rash", "Urticaria"
), Symptome2 = c("Vomiting", "Rash"), Symptome3 = c("Cough",
"Angioedema"), Symptome4 = c("Pain", "Pain")), class = "data.frame", row.names = c(NA, -2L))
Upvotes: 2