Reputation: 218
I have a data frame where a unique subject ID is repeated twice for all participants. The data following seems to consist of one column for which the value is NA one of the entries, and one column where there is a value for one of the entries (though this is not certain and the method I use should account for the possibility of this not being true). Here is an example:
Name <- c("Jon", "Jon", "Maria", "Maria", "Tina", "Tina", "dan", 'dan', 'wen', 'wen')
a <- c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)
b <- c(NA, 1, NA, 2, NA, 3, NA, 4, NA, 5)
c <- c(1, NA, 2, NA, 3, NA, 4, NA, 5, NA)
df <- data.frame(Name, a, b, c)
The solution I thought of so far consists of looping through all the unique IDs (in the above example, Names) and making separate dataframes for each of the entries. Something like this:
#Instantiate list of lists that will become dfs
firstdf <- c()
seconddf <- c()
#Loop through existing df by unique ID (Name) and create
# list containing values of 1 entry and list of the other
for (i in unique(df$Name)) {
innerlist1 <- c()
innerlist2 <- c()
for (x in c(1:length(df[df['Name'] == i]))) {
if (x%%2 == 1) {
# Takes one set of entries per ID
innerlist1 <- c(innerlist1, df[df['Name'] == i][x])
} else if (x%%2 == 0) {
# Takes other set of entries per ID
innerlist2 <- c(innerlist2, df[df['Name'] == i][x])
}
}
firstdf <- c(firstdf, list(innerlist1))
seconddf <- c(seconddf, list(innerlist2))
}
# Make dfs from lists
firstdf <- do.call(rbind.data.frame, firstdf)
names(firstdf) <- names(df)
seconddf <- do.call(rbind.data.frame, seconddf)
names(seconddf) <- names(df)
I would then proceed to combine the dfs by using something like merge, with by="Name"
.
My original dataset is large and this is not particularly efficient or elegant. Can anyone suggest improvements?
Upvotes: 3
Views: 625
Reputation: 218
BTW, for future readers, what I eventually ended up doing was taking each entry per ID by odd/even index and making two dataframes like so:
firstdf <- df[seq_len(nrow(df))%%2 == 1, ]
seconddf <- df[seq_len(nrow(df))%%2 == 0, ]
After this point it's just a question of dropping columns where all entries are NAs and then merging the dfs, whilst dealing with situations where both dfs have non-NA values in the same location accordingly (e.g., by taking the mean of the two values).
Some extra steps I also had to take in my real-life situation that the simplicity of this example does not capture include:
df <- df[order(df$Name), ]
rownames(df) <- NULL
#Using dplyr
library(dplyr)
df %>%
count(Name) %>%
filter(n!=2)
# Should return 0 rows
In the case that there were more than or less than two entries, I did the following:
more <- df %>%
count(name) %>%
filter(n>2)
df_more_than_two <- df[df$Name %in% more$Name]
# Change sign in filter function to < 2 for those with only one entry
I then made three dataframes (those with 1 entry, those with 2 entries, and those with 3 entries) but essentially performed the same steps
Upvotes: -1
Reputation: 41225
If there are multiple non-NA values per ID, you can concentrate them toString
. You can use the following code:
library(dplyr)
df %>%
group_by(Name) %>%
summarise_all(funs(toString(na.omit(.))))
Output:
# A tibble: 5 × 4
Name a b c
<chr> <chr> <chr> <chr>
1 dan 4, 4 4 4
2 Jon 1, 1 1 1
3 Maria 2, 2 2 2
4 Tina 3, 4 3 3
5 wen 5, 6 5 5
Upvotes: 1
Reputation: 5481
You can keep first value by group exluding NA
:
library(dplyr)
df %>%
group_by(Name) %>%
summarise(a = first(stats::na.omit(a)),
b = first(stats::na.omit(b)),
c = first(stats::na.omit(c)))
# A tibble: 5 x 4
Name a b c
<chr> <dbl> <dbl> <dbl>
1 dan 4 4 4
2 Jon 1 1 1
3 Maria 2 2 2
4 Tina 3 3 3
5 wen 5 5 5
Upvotes: 1