Reputation: 89
I have a dataframe containing user data :
age = c(45, 21, 32, 33, 46)
gender = c('female', 'female', 'male', 'male', 'female')
income = c('low', 'low', 'medium', 'high', 'low')
education = c('high', 'high', 'high', 'medium', 'medium')
df = data.frame(age, gender ,income, education)
From this i would like to obtain a legible list with a count & share of total for every attribute that i then would append to a table / csv that should be rather legible for further use than be a functioning dataframe. For one attribute that would be something like this:
nusers = nrow(users)
df = count(users, gender)
df['sot']=df['n']/totuser
write.table(df,'stat.csv',sep=';', row.names = FALSE, append = T)
With the following result desired for multiple attributes:
gender,n,sot
female,10,0.526315789
male,9,0.473684211
income,Freq,sot
low,4,0.210526316
medium,10,0.526315789
high,5,0.263157895
education,Freq,sot
low,8,0.421052632
medium,1,0.052631579
high,10,0.526315789
My (not very proficient) attempts to put this into a loop failed. How would i best go about this ?
Upvotes: 0
Views: 354
Reputation: 3336
Here's a solution with the dplyr
package.
The actual code could theoretically be confined to just a single line
library(dplyr)
# ...
for(nom in names(df)) write.table(df %>% count(!!sym(nom)) %>% mutate(sot = n/sum(n)), 'stat.csv', sep = ';', row.names = FALSE, append = TRUE)
to yield the output file stat.csv
"age";"n";"sot"
21;1;0.2
32;1;0.2
33;1;0.2
45;1;0.2
46;1;0.2
"gender";"n";"sot"
"female";3;0.6
"male";2;0.4
"income";"n";"sot"
"high";1;0.2
"low";3;0.6
"medium";1;0.2
"education";"n";"sot"
"high";3;0.6
"medium";2;0.4
but I've chosen to break up the workflow, with comments for clarity:
library(dplyr)
# ...
# Code to generate `df`
# ...
# Create list to accumulate the summaries
results <- list()
# For each variable (by name) in `df`...
for(nom in names(df)) {
# ...append to the list the results of summarizing by that variable.
results <- c(
results,
# Wrap summary in a `list` to append properly:
list(
df %>%
# Interpret the variable name as the variable itself, within the context
# of `df`; and count the occurrences of each of the values that variable
# takes on within `df`.
count(!!sym(nom)) %>%
# Sum up the counts to reconstruct the total amount; then divide the
# count `n` by that total, to obtain `sot`.
mutate(sot = n/sum(n))
) %>%
# Name that summary after the variable.
setNames(nm = nom)
)
}
# View results
results
Given your sample df
reproduced here
structure(
list(
age = c(45 , 21 , 32 , 33 , 46 ),
gender = c("female", "female", "male" , "male" , "female"),
income = c("low" , "low" , "medium", "high" , "low" ),
education = c("high" , "high" , "high" , "medium", "medium")
),
class = "data.frame",
row.names = c(NA, -5L)
)
this workflow should yield the following list
of results
:
$age
age n sot
1 21 1 0.2
2 32 1 0.2
3 33 1 0.2
4 45 1 0.2
5 46 1 0.2
$gender
gender n sot
1 female 3 0.6
2 male 2 0.4
$income
income n sot
1 high 1 0.2
2 low 3 0.6
3 medium 1 0.2
$education
education n sot
1 high 3 0.6
2 medium 2 0.4
My solution covers every variable in df
, but feel free to exclude variables like age
by modifying the for
-loop.
To write all this as the file stat.csv
, delimited by ;
as in your code, simply finish with:
for(summr in results) {
write.table(
x = summr,
file = 'stat.csv',
sep = ';',
row.names = FALSE,
append = TRUE
)
}
Upvotes: 1
Reputation: 91
You should use 'count_()' instead of 'count()' it is the same function but it take variable instead of string in 'var'.
library(dplyr)
for (i in class) {
df = count_(users, i)
write.csv(df, row.names = T, file = paste0('Title_',i,'.txt'))
}
Upvotes: 1
Reputation: 2783
You can use sink()
for this:
library(dplyr)
n_gen <- df %>% group_by(gender) %>% summarise(Feq = n(), sot = n()/nrow(df))
n_inc <- df %>% group_by(income) %>% summarise(Feq = n(), sot = n()/nrow(df))
n_edu <- df %>% group_by(education) %>% summarise(Feq = n(), sot = n()/nrow(df))
sink('export.csv')
write.csv(n_gen, row.names = F)
write.csv(n_inc, row.names = F)
write.csv(n_edu, row.names = F)
sink()
You could shorten it and write it in a for loop. Depending on how many columns you have (in df) that might be preferred
Upvotes: 1