Reputation: 1126
My dataset is organized as shown below (just a small extract) : for a given subject (here, subject=5), I have 3 tests performed at times D-1, D1-8h and D2-24h:
SUBJECT TIME TEST RESULT UNITS RANGES
591 5 D-1 Leukoyte count urine 1 /?L |-< 15|-
592 5 D-1 Erythrocyte count urine 0 /?L |-< 19|-
593 5 D-1 Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
684 5 D1 8h Leukoyte count urine 0 /?L |-< 15|-
687 5 D1 8h Erythrocyte count urine 0 /?L |-< 19|-
683 5 D1 8h Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
694 5 D2 24h Leukoyte count urine 1 /?L |-< 15|-
695 5 D2 24h Erythrocyte count urine 0 /?L |-< 19|-
696 5 D2 24h Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
I would like to reorganize these data in a table set by columns, in the following form:
TEST D-1 D1-8h D2-24h UNITS RANGES
so that I get one line by test.
I am confused with "table" and "aggregate", and I don't find a proper way to do that, although I am sure it's not so complicated...
Could you gve me some help?
Thanks
Here is dput:
> dput(dataset)
structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("D-1",
"D1 8h", "D2 24h", "D4 72h"), class = "factor"), TEST = structure(c(35L,
24L, 28L, 35L, 24L, 28L, 35L, 24L, 28L), .Label = c("", "Alkaline phosphatase",
"APTT", "Basophils", "Basophils (%)", "Calcium", "CD19", "CD19 abs.",
"CD3", "CD3 abs.", "CD4/CD8 ratio", "CD4+", "CD4+ abs.", "CD56",
"CD56 absolute", "CD8+", "CD8+ abs.", "Chloride", "CK (creatine kinase)",
"Creatinine", "Direct bilirubin (conjug)", "Eosinophils", "Eosinophils (%)",
"Erythrocyte count urine", "Erythrocyte dipstick urine", "Gamma GT",
"Glucose", "Glucose dipstick urine", "GOT (AST)", "GPT (ALT)",
"Hematocrit", "Hemoglobin", "Ketone bodies urine", "Leukocyte esterase urine",
"Leukoyte count urine", "Lymphocytes", "Lymphocytes (%)", "Monocytes",
"Monocytes (%)", "Neutrophils", "Neutrophils (%)", "pH urine",
"Platelet count", "Potassium", "Protein urine", "PT INR", "Red blood cell count",
"Reticulocytes", "Reticulocytes %", "Serum Albumine", "Sodium",
"Total bilirubin", "Total cholesterol", "Total protein", "Triglycerides",
"Urea", "Urine glucose quantitative", "Urine protein quantitative",
"White blood cell count"), class = "factor"), RESULT = c("1",
"0", "Normal", "0", "0", "Normal", "1", "0", "Normal"), UNITS = c("/?L",
"/?L", "None", "/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-",
"|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-",
"|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT",
"TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(591L,
592L, 593L, 684L, 687L, 683L, 694L, 695L, 696L), class = "data.frame")
Upvotes: 0
Views: 69
Reputation: 69151
I believe you are asking for a pretty straight forward implementation of dcast()
which takes data from long to wide. Here's one implementation using data.table
package.
library(data.table)
#> Warning: package 'data.table' was built under R version 3.4.4
x <- structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = c("D-1", "D-1", "D-1", "D1 8h", "D1 8h", "D1 8h", "D2 24h",
"D2 24h", "D2 24h"), TEST = c("Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine"), RESULT = c("1", "0", "Normal", "0",
"0", "Normal", "1", "0", "Normal"), UNITS = c("/?L", "/?L", "None",
"/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-",
"|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-",
"|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT",
"TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(NA,
-9L), class = c("data.table", "data.frame"))
dcast(SUBJECT + TEST ~ TIME, data = x, value.var = c("UNITS", "RANGES"))
#> SUBJECT TEST UNITS_D-1 UNITS_D1 8h UNITS_D2 24h
#> 1: 5 Erythrocyte count urine /?L /?L /?L
#> 2: 5 Glucose dipstick urine None None None
#> 3: 5 Leukoyte count urine /?L /?L /?L
#> RANGES_D-1 RANGES_D1 8h RANGES_D2 24h
#> 1: |-< 19|- |-< 19|- |-< 19|-
#> 2: |+ from 50 mg/dL-|- |+ from 50 mg/dL-|- |+ from 50 mg/dL-|-
#> 3: |-< 15|- |-< 15|- |-< 15|-
Created on 2019-02-23 by the reprex package (v0.2.1)
Maybe this is what you want (if not, put the expected output in your question to avoid everyone guessing):
dcast(SUBJECT + TEST + UNITS + RANGES ~ TIME, data = df, value.var = "RESULT")
SUBJECT TEST UNITS RANGES D-1 D1 8h D2 24h
1 5 Erythrocyte count urine /?L |-< 19|- 0 0 0
2 5 Glucose dipstick urine None |+ from 50 mg/dL-|- Normal Normal Normal
3 5 Leukoyte count urine /?L |-< 15|- 1 0 1
Upvotes: 0
Reputation: 76402
Is this it? If so I believe it should be marked as a duplicate of reshape data from long to wide in R.
library(tidyverse)
spread(dataset, key = TIME, value = UNITS)
# SUBJECT TEST RESULT RANGES D-1 D1 8h D2 24h
#1 5 Erythrocyte count urine 0 |-< 19|- /?L /?L /?L
#2 5 Glucose dipstick urine Normal |+ from 50 mg/dL-|- None None None
#3 5 Leukoyte count urine 0 |-< 15|- <NA> /?L <NA>
#4 5 Leukoyte count urine 1 |-< 15|- /?L <NA> /?L
Edit.
In his comment, Peter_Evan corrects the above. Th right solution is
spread(dataset, key = TIME, value = RESULT)
# SUBJECT TEST UNITS RANGES D-1 D1 8h D2 24h
#1 5 Erythrocyte count urine /?L |-< 19|- 0 0 0
#2 5 Glucose dipstick urine None |+ from 50 mg/dL-|- Normal Normal Normal
#3 5 Leukoyte count urine /?L |-< 15|- 1 0 1
or, if the OP wants to reorder the columns, the following.
dataset %>%
spread(key = TIME, value = RESULT) %>%
select(SUBJECT,TEST, `D-1`:`D2 24h`, UNITS, RANGES)
# SUBJECT TEST D-1 D1 8h D2 24h UNITS RANGES
#1 5 Erythrocyte count urine 0 0 0 /?L |-< 19|-
#2 5 Glucose dipstick urine Normal Normal Normal None |+ from 50 mg/dL-|-
#3 5 Leukoyte count urine 1 0 1 /?L |-< 15|-
Upvotes: 1