Reputation: 2179

How to create tidy data with a Dataset where values are duplicated over many rows

Sorry if this is too large of a example. It does looks more real life but I have a hard time thinking of an example that would better explain my situation.

What I want is a tidy data.frame where I can use the medical conditions in summaries (avg) and in plots (edited) What I Need Answered Am I trying to accomplish this correctly. Do I want row with a HUGE string with values divided up by commas? Do I need to split it into more columns?

Reports from our data base vendor (actual data changed). The reports do not give a unique key. In my data.frames the person.id is unique in some and others are like this with multi rows of person.id and values.

person.id <- c("1017", "1018", "1018", "1018", "1018", "1018", "1018",
               "1018", "1018", "1018", "1018", "1019", "1019", "1020",
               "1020")
med.condition <- c(NA, "Allergic rhinitis", "Allergic rhinitis",
                   "Atopic Dermatitis", "Atopic Dermatitis",
                   "Developmental Speech",
                   "Developmental Speech",
                   "Eye Condition", "Eye Condition", "Speech Delay",
                   "Speech Delay", "Allergic Reaction", NA, "Eczema",
                   "Obese")
cond.type <- c("Assessment", "Assessment", NA, "Assessment", NA, "Assessment",
               NA, "Assessment", NA, "Assessment", NA, "Assessment",
               "Assessment", "Assessment", "Assessment")
df <- data.frame(person.id, med.condition, cond.type)

Which looks like:

  person.id  med.condition                              cond.type
1   1017    NA                                          Assessment
2   1018    Allergic rhinitis                           Assessment
3   1018    Allergic rhinitis                           NA
4   1018    Atopic Dermatitis                           Assessment
5   1018    Atopic Dermatitis                           NA
6   1018    Developmental Speech                        Assessment
7   1018    Developmental Speech                        NA
8   1018    Eye Condition                               Assessment
9   1018    Eye Condition                               NA
10  1018    Speech Delay                                Assessment
11  1018    Speech Delay                                NA
12  1019    Allergic Reaction                           Assessment
13  1019    NA                                          Assessment
14  1020    Eczema                                      Assessment
15  1020    Obese                                       Assessment

I want rows to equal one person.id

Do I want to have it look like this (only first 5 columns shown): Used taplly which fails at tidy

    condition1         condition2        condition3        condition4           condition5
1017 NA                 NA                NA                NA                   NA
1018 Allergic rhinitis Atopic Dermatitis  Allergic Reaction Developmental Speech Eye Condition
1019 NA                 NA                NA                NA                   NA
1020 Eczema             Obese             NA                NA                   NA

What do you do to make a dataset tidy?

     med.condtion
1017 NA
1018 "Allergic rhinitis", "Atopic Dermatitis", "Developmental Speech", "Eye Condition", "Speech Delay", "Allergic Reaction" 
1019 NA
1020 "Eczema" "Obese"

Or do I need to think of this in a new way?

What I tired tapply, reshape2

taplly Doesn't work on this example but does in my program Sorry

df2 <- data.frame(person.id, med.condition, cond.type)
df2.wide <- tapply(X = df2$medical.condition, INDEX = df2$person.id,
                        function(x){
                          ux <- unique(x)
                          c(ux, rep(x = NA, 9 - length (ux)))
                        })
df2.wide <- as.data.frame(do.call('rbind', df2.wide), stringsAsFactors = FALSE)
names(promis.b.temp) <- paste0('condition', 1:9)

cols <- names(promis.b.temp) df2$med.all <- apply( df2[, cols], 1, paste, collapse = ", ")

reshape2 Quickly realized that wouldn't work library(reshape2) df3 <- test %>% melt() %>% unique() %>% cast(person.id)

Am I tackling this issue correctly?
Will I have problems when I do my reports with having to do filters with the strings?

Upvotes: 1

Answers (3)

A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

I don't really understand this question. Your data already appear to be "tidy".

The two things I notice are (1) duplicated values (which may or may not be wanted) and (2) the lack of a unique id per person and medical condition.

If you wanted a long string separated by commas (hard to deal with later, in my opinion), you can just aggregate by the unique values in the first two columns, like this:

library(data.table)
as.data.table(unique(df[1:2]))[, paste(med.condition, collapse = ","), by = person.id]
#    person.id                                                                                  V1
# 1:      1017                                                                                  NA
# 2:      1018 Allergic rhinitis,Atopic Dermatitis,Developmental Speech,Eye Condition,Speech Delay
# 3:      1019                                                                Allergic Reaction,NA
# 4:      1020                                                                        Eczema,Obese

If you wanted to easily get a sequential ID for each person, you can use getanID from my "splitstackshape" package:

library(splitstackshape)
getanID(as.data.table(unique(df[1:2]))

This would let you convert to the wide form, if so desired, by using dcast.data.table, like this:

library(splitstackshape)
dcast.data.table(getanID(as.data.table(unique(df[1:2])), "person.id"), 
                 person.id ~ .id, value.var = "med.condition", 
                 fun.aggregate = function(x) paste(x, collapse = ","))
#    person.id                 1                 2                    3             4            5
# 1:      1017                NA                                                                  
# 2:      1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
# 3:      1019 Allergic Reaction                NA                                                
# 4:      1020            Eczema             Obese

Upvotes: 2

Bahae Omid

Reputation: 574

Your dataframe is in what's called the "long" format and you want to reshape it to the "wide" format. try below:

require(reshape2)
df.new <- reshape(df,idvar='person.id',timevar='cond.type',direction='wide')

Upvotes: 1

MrFlick

Reputation: 206232

You can do this with the base reshape() function if you just add a "time" indicator to each observation (which you can easily do with ave()). If you run

reshape(
    transform(
        unique(df[, c("person.id","med.condition")]), 
        time=ave(as.numeric(person.id), person.id, FUN=seq_along)
    ), 
    idvar="person.id", 
    v.names="med.condition",
    direction="wide")

you will get

person.id   med.condition.1 med.condition.2 med.condition.3 med.condition.4 med.condition.5
1017    NA  NA  NA  NA  NA
1018    Allergic rhinitis   Atopic Dermatitis   Developmental Speech    Eye Condition   Speech Delay
1019    Allergic Reaction   NA  NA  NA  NA
1020    Eczema  Obese   NA  NA  NA

Upvotes: 1

How to create tidy data with a Dataset where values are duplicated over many rows

Answers (3)

Related Questions