Reputation: 2179
Sorry if this is too large of a example. It does looks more real life but I have a hard time thinking of an example that would better explain my situation.
What I want is a tidy data.frame where I can use the medical conditions in summaries (avg) and in plots (edited) What I Need Answered Am I trying to accomplish this correctly. Do I want row with a HUGE string with values divided up by commas? Do I need to split it into more columns?
Reports from our data base vendor (actual data changed). The reports do not give a unique key. In my data.frames the person.id is unique in some and others are like this with multi rows of person.id and values.
person.id <- c("1017", "1018", "1018", "1018", "1018", "1018", "1018",
"1018", "1018", "1018", "1018", "1019", "1019", "1020",
"1020")
med.condition <- c(NA, "Allergic rhinitis", "Allergic rhinitis",
"Atopic Dermatitis", "Atopic Dermatitis",
"Developmental Speech",
"Developmental Speech",
"Eye Condition", "Eye Condition", "Speech Delay",
"Speech Delay", "Allergic Reaction", NA, "Eczema",
"Obese")
cond.type <- c("Assessment", "Assessment", NA, "Assessment", NA, "Assessment",
NA, "Assessment", NA, "Assessment", NA, "Assessment",
"Assessment", "Assessment", "Assessment")
df <- data.frame(person.id, med.condition, cond.type)
Which looks like:
person.id med.condition cond.type
1 1017 NA Assessment
2 1018 Allergic rhinitis Assessment
3 1018 Allergic rhinitis NA
4 1018 Atopic Dermatitis Assessment
5 1018 Atopic Dermatitis NA
6 1018 Developmental Speech Assessment
7 1018 Developmental Speech NA
8 1018 Eye Condition Assessment
9 1018 Eye Condition NA
10 1018 Speech Delay Assessment
11 1018 Speech Delay NA
12 1019 Allergic Reaction Assessment
13 1019 NA Assessment
14 1020 Eczema Assessment
15 1020 Obese Assessment
I want rows to equal one person.id
Do I want to have it look like this (only first 5 columns shown): Used taplly which fails at tidy
condition1 condition2 condition3 condition4 condition5
1017 NA NA NA NA NA
1018 Allergic rhinitis Atopic Dermatitis Allergic Reaction Developmental Speech Eye Condition
1019 NA NA NA NA NA
1020 Eczema Obese NA NA NA
What do you do to make a dataset tidy?
med.condtion
1017 NA
1018 "Allergic rhinitis", "Atopic Dermatitis", "Developmental Speech", "Eye Condition", "Speech Delay", "Allergic Reaction"
1019 NA
1020 "Eczema" "Obese"
Or do I need to think of this in a new way?
What I tired tapply, reshape2
taplly Doesn't work on this example but does in my program Sorry
df2 <- data.frame(person.id, med.condition, cond.type)
df2.wide <- tapply(X = df2$medical.condition, INDEX = df2$person.id,
function(x){
ux <- unique(x)
c(ux, rep(x = NA, 9 - length (ux)))
})
df2.wide <- as.data.frame(do.call('rbind', df2.wide), stringsAsFactors = FALSE)
names(promis.b.temp) <- paste0('condition', 1:9)
cols <- names(promis.b.temp) df2$med.all <- apply( df2[, cols], 1, paste, collapse = ", ")
reshape2 Quickly realized that wouldn't work library(reshape2) df3 <- test %>% melt() %>% unique() %>% cast(person.id)
Upvotes: 1
Views: 189
Reputation: 193527
I don't really understand this question. Your data already appear to be "tidy".
The two things I notice are (1) duplicated values (which may or may not be wanted) and (2) the lack of a unique id per person and medical condition.
If you wanted a long string separated by commas (hard to deal with later, in my opinion), you can just aggregate by the unique values in the first two columns, like this:
library(data.table)
as.data.table(unique(df[1:2]))[, paste(med.condition, collapse = ","), by = person.id]
# person.id V1
# 1: 1017 NA
# 2: 1018 Allergic rhinitis,Atopic Dermatitis,Developmental Speech,Eye Condition,Speech Delay
# 3: 1019 Allergic Reaction,NA
# 4: 1020 Eczema,Obese
If you wanted to easily get a sequential ID for each person, you can use getanID
from my "splitstackshape" package:
library(splitstackshape)
getanID(as.data.table(unique(df[1:2]))
This would let you convert to the wide form, if so desired, by using dcast.data.table
, like this:
library(splitstackshape)
dcast.data.table(getanID(as.data.table(unique(df[1:2])), "person.id"),
person.id ~ .id, value.var = "med.condition",
fun.aggregate = function(x) paste(x, collapse = ","))
# person.id 1 2 3 4 5
# 1: 1017 NA
# 2: 1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
# 3: 1019 Allergic Reaction NA
# 4: 1020 Eczema Obese
Upvotes: 2
Reputation: 574
Your dataframe is in what's called the "long" format and you want to reshape it to the "wide" format. try below:
require(reshape2)
df.new <- reshape(df,idvar='person.id',timevar='cond.type',direction='wide')
Upvotes: 1
Reputation: 206232
You can do this with the base reshape()
function if you just add a "time" indicator to each observation (which you can easily do with ave()
). If you run
reshape(
transform(
unique(df[, c("person.id","med.condition")]),
time=ave(as.numeric(person.id), person.id, FUN=seq_along)
),
idvar="person.id",
v.names="med.condition",
direction="wide")
you will get
person.id med.condition.1 med.condition.2 med.condition.3 med.condition.4 med.condition.5
1017 NA NA NA NA NA
1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
1019 Allergic Reaction NA NA NA NA
1020 Eczema Obese NA NA NA
Upvotes: 1