Natasa
Natasa

Reputation: 11

Partition several vectors into groups (knowing the range of the values of each group) for barplots

My name is Natasa, I’m new in R. I’m impressed by what R can do, but unfortunately I don’t have the time to learn it from the beginning.

I have a lot of vectors (11) with 10000 values/numbers each, so I will be using a more “compact” version. Let’s say that I have 4 vectors: Where TI=Time, and RE= Region (1, 2 or 3).

TI -> c(10, 20, 30, 40, 50, 100, 150, 200, 300)
RE1 -> c(0.25, 0.78, 0.35, 0.37, 4.56, 5.23, 3.75, 8.51, 10.85)
RE2 -> c(0.05, 1.54, 0.4, 0.42, 2.53, 1.38, 4.58, 10.54, 25.35)
RE3 -> c(0.02, 0.53, 0.72, 0.28, 7.82, 13.51, 23.54, 2.15)

I want to create groups of “TI” (Time series: group1= TI corresponding to 10, 20, 30 and 40, group2= between 50-150 and group3= 200 and 300) and compute the mean and stdev for each RE vector according to /depending on the groups of TI. Each group is of unequal length and I don’t know the number of “variables” in each group (only the “range”). My final goal is to create a grouped bar plot for each group of TI and for each RE vector. In x axis there will be the groups of TI (the time series) and in y axis “values” of the regions, where in each time series there will be a separate “histogram” for each region.

I have found on the internet several pages and I have tried several things, but without any success. My thoughts were:

  1. To create a “table” (using the cbind function) like this: All -> cbind(TI, RE1, RE2, RE3)
  2. Partition the TI vector into groups and the other vectors according to the TI grouping. The pages that I have found are: Using the split function, as in: How to partition a vector into groups of neighbors in R? Split a vector into three vectors of unequal length in R or rename all the different values of TI according to the groups (group1, group2 and group3) using the replace function, like in: Replace given value in vector
  3. Use the aggregate function like in: Mean per group in a data.frame or R: how can I create a table with mean and sd according to experimental group alongside p-values?
  4. And finally use the barplot function.

The only problem is that I can’t found the correct way to split the table in the desired groups or in an “easy” way to rename specific values of TI (thought 2). Wanted table (If my "thoughts" are correct)

TI RE1 RE2 RE3
group1 0.25 0.05 0.02
group1 0.78 1.54 0.53
group1 0.35 0.4 0.72
group1 0.37 0.42 0.28
group2 4.56 2.53 7.82
group2 5.23 1.38 13.51
group2 3.75 4.58 23.54
group3 8.51 10.54 2.15
group3 10.85 25.35 0.65

Since my data is large, I don’t think that the replace function for each value is “affordable”. My other thought was to compute separately the mean and SD for each group of TI and RE and then to insert a column with the desire names of the group and then combine all the “tables” in one… but it will be very time consuming and not practical. Is there a way to “say” in R to rename all the numbers between 10-40 to group1, values between 50-150 to group2 etc. of the vector TI or that the numbers between… are a group etc.? If not, is there an easiest way to compute mean and sd for a specific range of values of a different vector? Or all those things aren’t needed and I can do it using the barplot function (I also tried to do it… without any success)?

It is really hard for me to figure it out with such limited experience, and any help will be greatly appreciated!! Thanks in advance for your responses.

Upvotes: 1

Views: 1052

Answers (2)

Alexey Shiklomanov
Alexey Shiklomanov

Reputation: 1652

For picking out values in a group, the %in% construct is handy, although Froom's suggestion with < and > is more robust.

a <- c(10, 13, 18, 21, 15, 32)
a %in% 10:20
# [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

For summarizing and generally working with data, I would check out the data.table package.

library(data.table)
data <- data.table(TI = c(10, 20, 30, 40, 50, 100, 150, 200, 300),
                   RE1 = c(0.25, 0.78, 0.35, 0.37, 4.56, 5.23, 3.75, 8.51, 10.85),
                   RE2 = c(0.05, 1.54, 0.4, 0.42, 2.53, 1.38, 4.58, 10.54, 25.35),
                   RE3 = c(0.02, 0.53, 0.72, 0.28, 7.82, 13.51, 23.54, 2.15, NA))
g1 <- 1:40
g2 <- 41:150
data[TI %in% g1, gp := "group1"]
data[TI %in% g2, gp := "group2"]
data[TI > 150, gp := "group3"]
data
#     TI   RE1   RE2   RE3     gp
# 1:  10  0.25  0.05  0.02 group1
# 2:  20  0.78  1.54  0.53 group1
# 3:  30  0.35  0.40  0.72 group1
# 4:  40  0.37  0.42  0.28 group1
# 5:  50  4.56  2.53  7.82 group2
# 6: 100  5.23  1.38 13.51 group2
# 7: 150  3.75  4.58 23.54 group2
# 8: 200  8.51 10.54  2.15 group3
# 9: 300 10.85 25.35    NA group3

The := performs an internal assignment, which can be used to reassign new values to an old column or create a new column. Basically the same thing as data$gp <- .... Also, as you may have noticed, a nice feature of data.tables is that they implicitly use with syntax; i.e. it knows you're talking about its columns and don't have to specify data$... every time.

Then, summarizing is really easy.

data[, lapply(.SD, mean, na.rm=TRUE), by = gp, .SDcols=c("RE1", "RE2", "RE3")]
#        gp      RE1     RE2      RE3
# 1: group1 0.437500  0.6025  0.38750
# 2: group2 4.513333  2.8300 14.95667
# 3: group3 9.680000 17.9450  2.15000

This syntax is a little strange, but here's the gist: lapply(l, FUN, ...) takes a list or vector (l) and applies the function (FUN) to every value of l, with ... as additional arguments to FUN. Here, .SD refers to the data.table you're currently in (data), so in words, that whole block is saying "apply function mean with arguments na.rm=TRUE to every column of the data.table I'm working on"). by allows you to subset based on a group (in this case, column gp). Finally, .SDcols indicates by name which columns to use in the .SD. Omitting this causes .SD to refer to the ENTIRE data.table, which would fail here because the column gp is a "character" vector (and the mean of column T1 is, I think, meaningless for your purposes).

Upvotes: 0

Froom2
Froom2

Reputation: 1279

If you want your groups to be unevenly split (as in your example) then the following may be helpful, although there is likely to be a slicker way of doing it...

I have used the package dplyr to get the summaries by group, which you would need to install if you haven't already got it.

data <- data.frame(TI = c(10, 20, 30, 40, 50, 100, 150, 200, 300),
                   RE1 = c(0.25, 0.78, 0.35, 0.37, 4.56, 5.23, 3.75, 8.51, 10.85),
                   RE2 = c(0.05, 1.54, 0.4, 0.42, 2.53, 1.38, 4.58, 10.54, 25.35),
                   RE3 = c(0.02, 0.53, 0.72, 0.28, 7.82, 13.51, 23.54, 2.15, NA))

data$gp <- NA

data$gp[data$TI > 0 & data$TI < 41] <- "g1"
data$gp[data$TI > 41 & data$TI < 151] <- "g2"
data$gp[data$TI > 151] <- "g3"

library(dplyr)

data <- group_by(data, gp)

summarise(data, mean(RE1, na.rm = TRUE), mean(RE2, na.rm = TRUE), mean(RE3, na.rm = TRUE))

summarise(data, sd(RE1, na.rm = TRUE), sd(RE2, na.rm = TRUE), sd(RE3, na.rm = TRUE))

Upvotes: 0

Related Questions