Reputation: 135

Analysing several columns of a dataset at the same time

I work with a real large dataset, where it is very difficult to look at all columns individually. At this time I only want to count the frequency of the information provided. Lets say I have a dataframe which looks like this:

|ID |Symptome1|Symptome2|Symptome3 |Symptome4|
|123|Rash     |Vomiting |Cough     |Pain|
|345|Urticaria|Rash     |Angioedema|Pain|

The symptoms occur at different time intervals, so I want to calculate the number of symptoms at each time interval. Symptome1 is time interval 1, Symptome2 at time interval 2, etc. Now I want to count all Symptome1, all Samyptome2, all Symptome3 etc. separately.

In a small dataset I would probably use

dataset%>%count(Symptome1)

Of course I could do it for all columns separately, but I guess I need something like a loop?

Upvotes: 3

Answers (4)

ThomasIsCoding

Reputation: 102529

Here is another way of presenting the results

table(
  reshape(
    dat,
    idvar = "ID",
    varying = -1,
    direction = "long",
    v.names = "Symptome",
    timevar = "SymptomeID"
  )
)

which gives a high-dimensional array without missing any information

, , Symptome = Angioedema

     SymptomeID
ID    1 2 3 4
  123 0 0 0 0
  345 0 0 1 0

, , Symptome = Cough

     SymptomeID
ID    1 2 3 4
  123 0 0 1 0
  345 0 0 0 0

, , Symptome = Pain

     SymptomeID
ID    1 2 3 4
  123 0 0 0 1
  345 0 0 0 1

, , Symptome = Rash

     SymptomeID
ID    1 2 3 4
  123 1 0 0 0
  345 0 1 0 0

, , Symptome = Urticaria

     SymptomeID
ID    1 2 3 4
  123 0 0 0 0
  345 1 0 0 0

, , Symptome = Vomiting

     SymptomeID
ID    1 2 3 4
  123 0 1 0 0
  345 0 0 0 0

On top of the above high-dimensional format, you can slice the array from an interest aspect, e.g., Symptome

table(
  reshape(
    dat,
    idvar = "ID",
    varying = -1,
    direction = "long",
    v.names = "Symptome",
    timevar = "SymptomeID"
  )
) %>% asplit("Symptome")

and then you will obtain a list

$Angioedema
     SymptomeID
ID    1 2 3 4
  123 0 0 0 0
  345 0 0 1 0

$Cough
     SymptomeID
ID    1 2 3 4
  123 0 0 1 0
  345 0 0 0 0

$Pain
     SymptomeID
ID    1 2 3 4
  123 0 0 0 1
  345 0 0 0 1

$Rash
     SymptomeID
ID    1 2 3 4
  123 1 0 0 0
  345 0 1 0 0

$Urticaria
     SymptomeID
ID    1 2 3 4
  123 0 0 0 0
  345 1 0 0 0

$Vomiting
     SymptomeID
ID    1 2 3 4
  123 0 1 0 0
  345 0 0 0 0

Upvotes: 1

Friede

Reputation: 7979

We have stack(). With table():

> table(stack(dat[-1]))
            ind
values       Symptome1 Symptome2 Symptome3 Symptome4
  Angioedema         0         0         1         0
  Cough              0         0         1         0
  Pain               0         0         0         2
  Rash               1         1         0         0
  Urticaria          1         0         0         0
  Vomiting           0         1         0         0

Or aggregate():

> X = stack(dat[-1])
> X$n = seq(nrow(X))
> aggregate(n ~ ., X, FUN = length)
      values       ind n
1       Rash Symptome1 1
2  Urticaria Symptome1 1
3       Rash Symptome2 1
4   Vomiting Symptome2 1
5 Angioedema Symptome3 1
6      Cough Symptome3 1
7       Pain Symptome4 2

dat is stolen from SamR's answer.

Upvotes: 2

jpsmith

Reputation: 17656

In base R you can use lapply and make a table out of each column. This will store each column in a list:

lapply(dat[-1], table)

# Note if you have additional columns and only want the ones that have "Symptome" (sic), you can use:
lapply(dat[grep("Symptome", names(dat))], table)

If you want them in a flat format, you can use sapply and use labels = unique(unlist(dat[-1])) to ensure you've covered all the possible symptoms in the overall dataset:

sapply(dat[-1], function(x) 
  table(factor(x, levels = unique(unlist(dat[-1]))))
)

Output:

#            Symptome1 Symptome2 Symptome3 Symptome4
# Rash               1         1         0         0
# Urticaria          1         0         0         0
# Vomiting           0         1         0         0
# Cough              0         0         1         0
# Angioedema         0         0         1         0
# Pain               0         0         0         2

Data:

# using data from SamR's answer (thanks @SamR!)
dat <- structure(list(ID = c(123L, 345L), Symptome1 = c("Rash", "Urticaria"
), Symptome2 = c("Vomiting", "Rash"), Symptome3 = c("Cough", 
                                                    "Angioedema"), Symptome4 = c("Pain", "Pain")), class = "data.frame", row.names = c(NA, -2L))

Per your comment, if you want to make a barplot:

xx <- sapply(dat[-1], function(x) 
  table(factor(x, levels = unique(unlist(dat[-1]))))
)

barplot(xx, beside = TRUE,
        legend.text = TRUE,
        args.legend = list(x = "topleft", bty = "n"))

Upvotes: 2

SamR

Reputation: 20494

I think easier to do this by putting the data in long form first using tidyr::pivot_longer().

dat |>
    tidyr::pivot_longer(-ID, names_to = "Symptome") |>
    count(Symptome, value)

# # A tibble: 7 × 3
#   Symptome  value          n
#   <chr>     <chr>      <int>
# 1 Symptome1 Rash           1
# 2 Symptome1 Urticaria      1
# 3 Symptome2 Rash           1
# 4 Symptome2 Vomiting       1
# 5 Symptome3 Angioedema     1
# 6 Symptome3 Cough          1
# 7 Symptome4 Pain           2

Data

dat <- structure(list(ID = c(123L, 345L), Symptome1 = c("Rash", "Urticaria"
), Symptome2 = c("Vomiting", "Rash"), Symptome3 = c("Cough", 
"Angioedema"), Symptome4 = c("Pain", "Pain")), class = "data.frame", row.names = c(NA, -2L))

Upvotes: 2

Analysing several columns of a dataset at the same time

Answers (4)

Data

Related Questions