bountan
bountan

Reputation: 89

R count by group / Loop function and output to csv

I have a dataframe containing user data :

age = c(45, 21, 32, 33, 46)
gender = c('female', 'female', 'male', 'male', 'female')
income = c('low', 'low', 'medium', 'high', 'low')
education = c('high', 'high', 'high', 'medium', 'medium')

df = data.frame(age, gender ,income, education)

From this i would like to obtain a legible list with a count & share of total for every attribute that i then would append to a table / csv that should be rather legible for further use than be a functioning dataframe. For one attribute that would be something like this:

nusers = nrow(users)
df = count(users, gender)
df['sot']=df['n']/totuser
write.table(df,'stat.csv',sep=';', row.names = FALSE, append = T)

With the following result desired for multiple attributes:

gender,n,sot
female,10,0.526315789
male,9,0.473684211
income,Freq,sot
low,4,0.210526316
medium,10,0.526315789
high,5,0.263157895
education,Freq,sot
low,8,0.421052632
medium,1,0.052631579
high,10,0.526315789

My (not very proficient) attempts to put this into a loop failed. How would i best go about this ?

Upvotes: 0

Views: 354

Answers (3)

Greg
Greg

Reputation: 3336

Here's a solution with the dplyr package.

The actual code could theoretically be confined to just a single line

library(dplyr)

# ...

for(nom in names(df)) write.table(df %>% count(!!sym(nom)) %>% mutate(sot = n/sum(n)), 'stat.csv', sep = ';', row.names = FALSE, append = TRUE)

to yield the output file stat.csv

"age";"n";"sot"
21;1;0.2
32;1;0.2
33;1;0.2
45;1;0.2
46;1;0.2
"gender";"n";"sot"
"female";3;0.6
"male";2;0.4
"income";"n";"sot"
"high";1;0.2
"low";3;0.6
"medium";1;0.2
"education";"n";"sot"
"high";3;0.6
"medium";2;0.4

but I've chosen to break up the workflow, with comments for clarity:

library(dplyr)


# ...
# Code to generate `df`
# ...


# Create list to accumulate the summaries
results <- list()

# For each variable (by name) in `df`...
for(nom in names(df)) {
  # ...append to the list the results of summarizing by that variable.
  results <- c(
    results,
    # Wrap summary in a `list` to append properly:
    list(
      df %>%
        # Interpret the variable name as the variable itself, within the context
        # of `df`; and count the occurrences of each of the values that variable
        # takes on within `df`.
        count(!!sym(nom)) %>%
        # Sum up the counts to reconstruct the total amount; then divide the
        # count `n` by that total, to obtain `sot`.
        mutate(sot = n/sum(n))
    ) %>%
      # Name that summary after the variable.
      setNames(nm = nom)
  )
}


# View results
results

Given your sample df reproduced here

structure(
  list(
    age       = c(45      , 21      , 32      , 33      , 46      ),
    gender    = c("female", "female", "male"  , "male"  , "female"),
    income    = c("low"   , "low"   , "medium", "high"  , "low"   ),
    education = c("high"  , "high"  , "high"  , "medium", "medium")
  ),
  class = "data.frame",
  row.names = c(NA, -5L)
)

this workflow should yield the following list of results:

$age
  age n sot
1  21 1 0.2
2  32 1 0.2
3  33 1 0.2
4  45 1 0.2
5  46 1 0.2

$gender
  gender n sot
1 female 3 0.6
2   male 2 0.4

$income
  income n sot
1   high 1 0.2
2    low 3 0.6
3 medium 1 0.2

$education
  education n sot
1      high 3 0.6
2    medium 2 0.4

My solution covers every variable in df, but feel free to exclude variables like age by modifying the for-loop.

To write all this as the file stat.csv, delimited by ; as in your code, simply finish with:

for(summr in results) {
  write.table(
    x = summr, 
    file = 'stat.csv',
    sep = ';',
    row.names = FALSE,
    append = TRUE
  )
}

Upvotes: 1

Maxime auffret
Maxime auffret

Reputation: 91

You should use 'count_()' instead of 'count()' it is the same function but it take variable instead of string in 'var'.

library(dplyr)

for (i in class) {
   df = count_(users, i)
   write.csv(df, row.names = T, file = paste0('Title_',i,'.txt'))
}

Upvotes: 1

koolmees
koolmees

Reputation: 2783

You can use sink() for this:

library(dplyr)
n_gen <- df %>% group_by(gender) %>% summarise(Feq = n(), sot = n()/nrow(df))
n_inc <- df %>% group_by(income) %>% summarise(Feq = n(), sot = n()/nrow(df))
n_edu <- df %>% group_by(education) %>% summarise(Feq = n(), sot = n()/nrow(df))

sink('export.csv')

write.csv(n_gen, row.names = F)
write.csv(n_inc, row.names = F)
write.csv(n_edu, row.names = F)

sink()

You could shorten it and write it in a for loop. Depending on how many columns you have (in df) that might be preferred

Upvotes: 1

Related Questions