Reputation: 5
My dataframe is as follows:
ds <- c("ab", "ab", "cd", "ab", NA, "gh")
member1age<-c(9, 21, 39, NA, 69, 90)
member1gender<-c(1,2,1,1,2,NA)
member2age<-c(15,30, 60, 21,12, 45)
member2gender<-c(1,2,2,1,2,1)
member3age<-c(17,2,90,NA,31,5)
member3gender<-c(1,1,2,NA, 1,2)
member4age<-c(NA, NA,23, NA,NA,NA)
member4gender<-c(NA,NA,1,NA,NA,NA)
df <- data.frame(ds,member1age,member1gender,member2age,member2gender,member3age,member3gender,member4age,member4gender)
I want to find out how many individuals,are there in each gender for the given age ranges in a perticular ds.
For example how many males between age 15-30 for ds- "ab". What i want as the output is something like the following; Table A:
Age Category Number of Males % Number of Females % Total Number %
Below 5 1 20.00 0 - 1 14.29
6-14 1 20.00 0 - 1 14.29
15-30 3 60.00 2 100.00 5 71.43
31-45 0 - 0 - 0 -
46-60 0 - 0 - 0 -
> 60 0 - 0 - 0 -
Total 5 100.00 2 100.00 7 100.00
what i have tried so far is
for each of memberage column applied the following;
forage1<-data.frame(ddply(df, .(ds), summarize,
All=length(ds),
of6Age=sum(Age1<6),
of6.15Age=sum(Age1>=6 & Age1<15),
of15.31Age=sum(Age1>=15 & Age1<31),
of31.46Age=sum(Age1>=31 & Age1<46),
of46.60Age=sum(Age1>=46 & Age1<=60),
of60Age=sum(Age1>60)))
And then merged the data frames together as follows; mylist<-list(forage1, forage2, forage3,forage4, forage5, forage6, forage7, forage8) f1<-ldply(mylist, rbind)
However this is not what i supposed to get.What i want is the Table A mentioned above. Please enlighten me with your kind support... Thank you in advance...!!!
Upvotes: 0
Views: 133
Reputation: 31171
You should first build your data in a proper way:
df <- data.frame(ds = rep(ds,4),
age = c(member1age,member2age,member3age, member4age),
gender = c(member1gender,member2gender,member3gender,member4gender))
Then use cut
to classify your population age:
df$AgeCategory = cut(df$age,
breaks=c(-1,5,14,30,45,60,Inf),
labels=c('Below 5','6-14','15-30','31-45','46-60','>60'),
right=T)
Finally summarise - here I use data.table
package:
library(data.table)
Then the data frame df has to be converted to a data table
df = setDT(df)
After that
dt = setkey(dt, AgeCategory, ds)
dt = dt[!is.na(ds) & !is.na(AgeCategory)] #not interested by NA in these columns
dt = dt[CJ(unique(AgeCategory),unique(ds)), allow.cartesian=T][is.na(gender), gender:=0]
dt[ds=='ab', list(MaleNumber=sum(gender==1), FemaleNumber=sum(gender==2)),AgeCategory][
,c('MaleRatio', 'FemaleRatio'):=list(MaleNumber/sum(MaleNumber),
FemaleNumber/sum(FemaleNumber))][]
# AgeCategory MaleNumber FemaleNumber MaleRatio FemaleRatio
#1: Below 5 1 0 0.2 0
#2: 6-14 1 0 0.2 0
#3: 15-30 3 2 0.6 1
#4: 31-45 0 0 0.0 0
#5: 46-60 0 0 0.0 0
#6: >60 0 0 0.0 0
Upvotes: 2
Reputation: 18487
First convert your df
into a long format
library(tidyr)
library(dplyr)
long <- df %>%
add_rownames("id") %>%
gather(complex, value, -ds, -id) %>%
extract(
complex,
c("member", "type"),
"([[:alpha:]]+[[:digit:]])([[:alpha:]]+)"
) %>%
spread(type, value)
Then use cut()
to split the age into groups
long %>%
mutate(
gender = factor(
gender,
levels = c(1, 2),
labels = c("male", "female")
),
fAge = cut(age, c(0, 5, 14, 30, 45, 60, 100))
) %>%
filter(!is.na(fAge), !is.na(gender)) %>%
group_by(fAge, gender) %>%
summarise(count = n()) %>%
spread(gender, count, fill = 0)
Upvotes: 0