micheal6126
micheal6126

Reputation: 13

R modify value in one column if content in another column contains string

I am working with a large data set where I want to determine if a column containing text strings passes a logical test where I can subset it out later. Currently, I'm trying to build the test column for each row. The test is based on if the cell contains less than 2 relevant characters, but I want to keep the full set of characters in the actual cell. The following is a simplified example of what I am trying to do:

Suppose I have the following data frame:

df <- data.frame(matrix(NA, nrow = 5, ncol = 1))
colnames(df) <- "test"
df$test <- c("one", "two", "three", "one", "onetwo")
df$hyp <- ("two", "one", "onetwo", "one", "two")
df$testcount <- sapply(df$test, str_length)
df$hypcount <- sapply(df$hyp, str_length)
df

    test    hyp testcount hypcount
1    one    two         3        3
2    two    one         3        3
3  three onetwo         5        6
4    one    one         3        3
5 onetwo    two         6        3

I want to identify a text string, like "two", and if a row in the test column or hyp column (depending on the test I'm running. I don't want this to run on both columns) contains the string (though is not identical to the string), then I want that same row to subtract the number of characters in the string I identified from the testcount or hypcount columns.

For example, if I run this function on the text string "two" in the test column, then I should get the following output:

    test    hyp testcount hypcount
1    one    two         3        3
2    two    one         0        3
3  three onetwo         5        6
4    one    one         3        3
5 onetwo    two         3        3

And if I were to run this on the hyp column, then I should get the following output:

    test    hyp testcount hypcount
1    one    two         3        0
2    two    one         3        3
3  three onetwo         5        3
4    one    one         3        3
5 onetwo    two         6        0

I have tried three approaches. First I tried using an if function to run the replacement conditionally (on this test I tested for the string "one" instead of "two"):

if(grepl("one", df$test)) {
  df[which(grepl("one", df$test)), ]$testcount = df[which(grepl("one", df$test)), ]$testcount - 3
  }

But this returns the warning: "In if (grepl("one", df$test)) { : the condition has length > 1 and only the first element will be used"

This leads to a correct substitution for the string "one", but not the string "two". Further, if I substitute for string "two" in the hyp column, the function works, but not if I run the substitution for string "one". I suspect this is because it only runs the test on the first row, and if it is true, then it checks the entire data frame.

Next I tried running the function within an lapply function:

df <- data.frame(lapply(df, function(x) {
  if(grepl("one", df$test)) {
    df[which(grepl("one", df$test)), ]$testcount = df[which(grepl("one", df$test)), ]$testcount - 3
  }}))

This also didn't work, though I don't fully understand why. Somehow it ends up returning the output:

  test hyp testcount hypcount
1    0   0         0        0
2    0   0         0        0
3    3   3         3        3

Lastly, I tried running it as an ifelse operation (here I switched to replacing for string "two" so I don't mistakenly think the function works on all rows):

df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3))

The strange thing is, this worked when I first applied it a few days ago. I tested it for the strings "two", "on", and "one", and it worked properly. Now, when I got around to applying it to my actual data, it didn't work. Further, when I went back to the testing to see what went wrong, it no longer is working. It simply returns the error: "Error in ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), : argument "no" is missing, with no default"

I have tried two solutions to this. First, I tried adding a statement into the "no" argument that would have no impact on my data:

 df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3), T)

However, this causes it to return the output:

    test    hyp testcount hypcount
1    one    two         1        3
2    two    one         3        3
3  three onetwo         1        6
4    one    one         1        3
5 onetwo    two         0        3

Next I tried substituting a meaningful "no" argument:

 df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount))

But now it returns the output:

    test    hyp testcount hypcount
1    one    two        -3        3
2    two    one         0        3
3  three onetwo        -3        6
4    one    one         0        3
5 onetwo    two        -3        3

I don't understand this output.

My question is, can anyone help me to understand why this is not working, and offer a solution? Thanks in advance!

Upvotes: 1

Views: 4184

Answers (2)

neilfws
neilfws

Reputation: 33772

Try this function:

subtract_match <- function(column1, column2, text, df) {
  df2 <- df
  df2[, column2] <- ifelse(grepl(text, df[, column1]), 
                           df[, column2] - nchar(text), 
                           df[, column2])
  df2
}

subtract_match("test", "testcount", "two", df1)

    test    hyp testcount hypcount
1    one    two         3        3
2    two    one         0        3
3  three onetwo         5        6
4    one    one         3        3
5 onetwo    two         3        3

subtract_match("hyp", "hypcount", "two", df1)

    test    hyp testcount hypcount
1    one    two         3        0
2    two    one         3        3
3  three onetwo         5        3
4    one    one         3        3
5 onetwo    two         6        0

Upvotes: 2

Maurits Evers
Maurits Evers

Reputation: 50668

I'm not sure I completely understand your problem, but the following reproduces your expected outcome for the two test cases.

# The first argument is the column where you want to find id
# I'm unsure about what you want to subtract; subtracting the entry from 
# the count columns corresponds to setting the entry to 0
some_function <- function(col, id, df) {
    idx <- which(colnames(df) == col);
    df[df[, idx] == id, idx + 2] <- 0;
    return(df);
}

some_function("test", "two", df);
#    test    hyp testcount hypcount
#1    one    two         3        3
#2    two    one         0        3
#3  three onetwo         5        6
#4    one    one         3        3
#5 onetwo    two         6        3

some_function("hyp", "two", df)
#    test    hyp testcount hypcount
#1    one    two         3        0
#2    two    one         3        3
#3  three onetwo         5        6
#4    one    one         3        3
#5 onetwo    two         6        0

Also, you say "contains the string (though is not identical to the string)", but for entries containing "onetwo" you do not subtract the count values. So it is complete matches that you are after?

Upvotes: 0

Related Questions