Reputation: 4883
I'm a bit lost about the best way to perform what I want in R.
I have a dataframe with several columns. One of them is user and other column, let's say number (0,1,2,3,4,5). Sometimes the user is repeated but the column number should have a sequence of numbers in this case. Can be something like this:
user number column B column C
1 0 85 200
2 1 165 852
1 2 200 1000
3 0 15 689
3 1 89 896
4 0 78 582
4 2 96 586
What I would like to obtain is:
user number status days
1 0 0 200
3 0 1 50 ->(value column C, user 3, number 1)-(value column B, user 3, number 0)
4 0 0 200
I want to add two columns to the dataframe based on values from these users.
I'm asking this question because I'm a bit confused about the best way to do this. I though about subset the dataframe where number == 0 and then compare users between dataframes to see repeated users but perhaps is it not the best way to do this.
Any pointers and ideas would be great!
Thank you for your help
EDIT
columnB is the number of days that have passed since '2002-01-01'. For instance the first user enter in the dataframe 85 days after 2002-01-01
columnC is the number of days that the user stayed in observation. After those days the user is no longer in the database.
EDIT2
My dataset has 36 columns and 26075rows
After some suggestions where's what I have so far.
#remove from dataset all users that don't have number == 0
df1 <- df[df$user %in% df[df$number == 0,1],]
df1["Status"] <- 0
#doesn't work
df1[df1$number == 1 %in% df1[duplicated(df1$user),]]
What I was trying to do is to find all duplicated users and if the number is equal to 1. To try something like this afterwards:
df1[df1$number == 1 %in% df1[duplicated(df1$user),]] <- df1$Status == 1
Upvotes: 0
Views: 371
Reputation: 4883
#order the dataset by columns needed
df2 <- df2[order(df2$user, df2$number),]
ids <- unique(df2$user)
#create variable status and days
status <- NULL
end <- NULL
for (i in 1:length(ids)) {
aux <- df2[df2$user %in% ids[i],]
if (nrow(aux) == 1 & aux$number[1] == 0){
status[i] <- 0
days[i] <- end
} else if (nrow(aux) > 1) {
status[i] <- 1
days[i] <- aux$Event[2]
} else {
status[i] <- NA
days[i] <- NA
}
}
#merge the vector created with the database
data <- data.frame(ids, status, days)
df2 <- df2[!duplicated(df2$user),]
df2 <- merge(df2, data, by.x="user", by.y="ids")
#remove NAs from the dataset
df.f <- df2[!(is.na(df2$days)),]
Upvotes: 1
Reputation: 2809
Perhaps subseting a data.frame per each user is not the most efficient way but it might be good enough for your data volume if you don't need an answer in just a few seconds.
ids = unique(df1$user);
status.days = sapply(ids,function(id) {
udf = subset(df1,subset= user == id);
if (nrow(udf) == 1) {
status = 0;
days = XXX; # the predefined value.
} else if (udf$number[2] != 1) {
status = 0;
days = XXX; # the predefined value;
} else {
status = 1;
days = udf$columnC[2] - udf$columnB[1];
}
c(status,days)
});
status = sapply(status.days,simplify=T,function(sd) { sd[1] });
days = sapply(status.days,simplify=T,function(sd) { sd[2] });
result = data.frame(user=ids,number=rep(0,length(ids)),status=status,days=days)
I have to confess that I am not sure about the indexing to extract status and days from the temporal list "status.days"... perhaps instead of sd[1] and sd[2] it should be sd[1][1] , sd[1][2] or sd[[1]][1], sd[[1]][2]
Upvotes: 1