psoares
psoares

Reputation: 4883

add columns based on repetions and conditions of other columns R

I'm a bit lost about the best way to perform what I want in R.

I have a dataframe with several columns. One of them is user and other column, let's say number (0,1,2,3,4,5). Sometimes the user is repeated but the column number should have a sequence of numbers in this case. Can be something like this:

user number  column B column C
1    0        85      200
2    1        165     852
1    2        200     1000
3    0        15      689
3    1        89      896
4    0        78      582
4    2        96      586 

What I would like to obtain is:

user number status days
1    0      0       200
3    0      1       50 ->(value column C, user 3, number 1)-(value column B, user 3, number 0)
4    0      0       200

I want to add two columns to the dataframe based on values from these users.

I'm asking this question because I'm a bit confused about the best way to do this. I though about subset the dataframe where number == 0 and then compare users between dataframes to see repeated users but perhaps is it not the best way to do this.

Any pointers and ideas would be great!

Thank you for your help

EDIT

columnB is the number of days that have passed since '2002-01-01'. For instance the first user enter in the dataframe 85 days after 2002-01-01

columnC is the number of days that the user stayed in observation. After those days the user is no longer in the database.

EDIT2

My dataset has 36 columns and 26075rows

After some suggestions where's what I have so far.

#remove from dataset all users that don't have number == 0
df1 <- df[df$user %in% df[df$number == 0,1],]

df1["Status"] <- 0

#doesn't work
df1[df1$number == 1 %in% df1[duplicated(df1$user),]]

What I was trying to do is to find all duplicated users and if the number is equal to 1. To try something like this afterwards:

 df1[df1$number == 1 %in% df1[duplicated(df1$user),]] <- df1$Status == 1

Upvotes: 0

Views: 371

Answers (2)

psoares
psoares

Reputation: 4883

 #order the dataset by columns needed
 df2 <- df2[order(df2$user, df2$number),]

 ids <- unique(df2$user)

 #create variable status and days
 status <- NULL
 end <- NULL
 for (i in 1:length(ids)) {
     aux <- df2[df2$user %in% ids[i],]
     if (nrow(aux) == 1 & aux$number[1] == 0){
          status[i] <- 0
          days[i] <- end
     } else if (nrow(aux) > 1) {
          status[i] <- 1
          days[i] <- aux$Event[2]
     } else {
          status[i] <- NA
          days[i] <- NA
     }
 }

 #merge the vector created with the database
 data <- data.frame(ids, status, days)
 df2 <- df2[!duplicated(df2$user),]
 df2 <- merge(df2, data, by.x="user", by.y="ids")

 #remove NAs from the dataset
 df.f <- df2[!(is.na(df2$days)),]

Upvotes: 1

Valentin Ruano
Valentin Ruano

Reputation: 2809

Perhaps subseting a data.frame per each user is not the most efficient way but it might be good enough for your data volume if you don't need an answer in just a few seconds.

ids = unique(df1$user);
status.days = sapply(ids,function(id) {
   udf = subset(df1,subset= user == id);
   if (nrow(udf) == 1) {
     status = 0;
     days = XXX; # the predefined value.
   } else if (udf$number[2] != 1) {
     status = 0;
     days = XXX; # the predefined value;
   } else {
     status = 1;
     days = udf$columnC[2] - udf$columnB[1];
   }
   c(status,days)
});

status = sapply(status.days,simplify=T,function(sd) { sd[1] });
days = sapply(status.days,simplify=T,function(sd) { sd[2] });

result = data.frame(user=ids,number=rep(0,length(ids)),status=status,days=days)

I have to confess that I am not sure about the indexing to extract status and days from the temporal list "status.days"... perhaps instead of sd[1] and sd[2] it should be sd[1][1] , sd[1][2] or sd[[1]][1], sd[[1]][2]

Upvotes: 1

Related Questions