Japes
Japes

Reputation: 209

R - Count observations (strings) in a column and calculate as a percentage of overall observations

Hi Have some data that looks like this:

ID   Item     Class  Value  Date
1    Eggs     A      5      07/07/21
2    Eggs     A      4.5    07/07/21
3    Cereal   C      2      07/07/21
4    Eggs     B      3.5    07/07/21
5    Bread    B      2.5    07/07/21
6    Juice    A      3      07/07/21
7    Juice    C      1.5    07/07/21
8    Eggs     C      2      07/07/21
9    Bread    A      3      07/07/21
10   Juice    A      3      07/07/21

This is just a sample, the actual data has around 8k rows (and there are 300+ dataframes). I want to create a new df that

  1. calculates the percentage of observations in the Item column (e.g. there are 10 items in the column, four of which are Eggs, thus the percentage of Eggs is 40)
  2. calculates the average Value of each Item group

Ideally, the final data would look something like this:

Item    Percentage  Average_Value  Date
Eggs    40          3.75           07/07/21
Cereal  10          2              07/07/21
Bread   20          2.75           07/07/21
Juice   30          2.5            07/07/21

A couple of things to note - I want to do this to 300+ different csv's (where my data is stored) and create a single df with this new information. Each csv refers to a different date, so the final df would look something like the above, only with another four lines, but with different values and a different date. I know that I'll need to use a for loop for that, and maybe I should figure that part out later, but thought it was worth mentioning now. Finally, at some point, I might want to also calculate the average value of Class. Would I be better of making a separate df for this as I can't see how else I could do this?

Upvotes: 0

Views: 927

Answers (4)

AnilGoyal
AnilGoyal

Reputation: 26218

Do these steps in tidyverse

  • set working directory to directory where your 300+ csv files are stored
  • read all 300+ csv names into temp
  • assuming each csv name refers to your date else you have to tweak the code a little bit
  • using map and imap_dfr as explained below you can execute the same code for each file only once and as a result you'll have a single dataframe
setwd('my/path/here')

temp <- list.files(pattern = '*.csv')

library(tidyverse)

map(temp, read.csv) %>% setNames(gsub('.csv', '', temp)) %>%
  imap_dfr(~ .x %>% group_by(item) %>%
         summarise(Percentage = n()/nrow(df)*100, 
                   Average_Value = mean(Value), .groups = 'drop') %>%
         mutate(Date = .y))

If instead all your csvs contain date column do this

map_dfr(temp, ~read.csv(.x) %>% group_by(item, date) %>%
          summarise(Percentage = n()/nrow(df)*100, 
                    Average_Value = mean(Value), .groups = 'drop'))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388972

List all the files that you want to use using list.files. Combine them in one file using map_df and for each Date calculate the percentage of Item present and it's average value.

library(tidyverse)

filenames <- list.files(pattern = '\\.csv$')

map_df(filenames, read_csv) %>%
  group_by(Date, Item) %>%
  summarise(Percentage = n(), 
            Average_Value = mean(Value)) %>%
  mutate(Percentage = prop.table(Percentage) * 100) %>%
  ungroup -> result

result

Upvotes: 0

GKi
GKi

Reputation: 39657

You can merge the result of proportions of table with the result form aggregate of the meanof Value ~ Item. In case the Date also need to be there, this could be added using cbind or data.frame.

merge(aggregate(cbind(Average_Value = Value) ~ Item, x, mean)
    , proportions(table(x$Item))*100
    , by.y=1, by.x="Item")[c(1,3,2)]
#    Item Freq Average_Value
#1  Bread   20          2.75
#2 Cereal   10          2.00
#3   Eggs   40          3.75
#4  Juice   30          2.50

Or using only aggregate:

aggregate(Value ~ Item, x, function(y) c(Freq=length(y)/nrow(x)*100, Average=mean(y)))
#    Item Value.Freq Value.Average
#1  Bread      20.00          2.75
#2 Cereal      10.00          2.00
#3   Eggs      40.00          3.75
#4  Juice      30.00          2.50

Upvotes: 1

Karthik S
Karthik S

Reputation: 11584

Does this work:

library(dplyr)
df %>% group_by(Item) %>% summarise(Percentage = n()/nrow(df)*100, Average_Value = mean(Value))
# A tibble: 4 x 3
  Item   Percentage Average_Value
  <chr>       <dbl>         <dbl>
1 Bread          20          2.75
2 Cereal         10          2   
3 Eggs           40          3.75
4 Juice          30          2.5 

Upvotes: 1

Related Questions