Reputation: 13
I am working with a large data set where I want to determine if a column containing text strings passes a logical test where I can subset it out later. Currently, I'm trying to build the test column for each row. The test is based on if the cell contains less than 2 relevant characters, but I want to keep the full set of characters in the actual cell. The following is a simplified example of what I am trying to do:
Suppose I have the following data frame:
df <- data.frame(matrix(NA, nrow = 5, ncol = 1))
colnames(df) <- "test"
df$test <- c("one", "two", "three", "one", "onetwo")
df$hyp <- ("two", "one", "onetwo", "one", "two")
df$testcount <- sapply(df$test, str_length)
df$hypcount <- sapply(df$hyp, str_length)
df
test hyp testcount hypcount
1 one two 3 3
2 two one 3 3
3 three onetwo 5 6
4 one one 3 3
5 onetwo two 6 3
I want to identify a text string, like "two", and if a row in the test
column or hyp
column (depending on the test I'm running. I don't want this to run on both columns) contains the string (though is not identical to the string), then I want that same row to subtract the number of characters in the string I identified from the testcount
or hypcount
columns.
For example, if I run this function on the text string "two" in the test
column, then I should get the following output:
test hyp testcount hypcount
1 one two 3 3
2 two one 0 3
3 three onetwo 5 6
4 one one 3 3
5 onetwo two 3 3
And if I were to run this on the hyp
column, then I should get the following output:
test hyp testcount hypcount
1 one two 3 0
2 two one 3 3
3 three onetwo 5 3
4 one one 3 3
5 onetwo two 6 0
I have tried three approaches. First I tried using an if function to run the replacement conditionally (on this test I tested for the string "one" instead of "two"):
if(grepl("one", df$test)) {
df[which(grepl("one", df$test)), ]$testcount = df[which(grepl("one", df$test)), ]$testcount - 3
}
But this returns the warning: "In if (grepl("one", df$test)) { : the condition has length > 1 and only the first element will be used"
This leads to a correct substitution for the string "one", but not the string "two". Further, if I substitute for string "two" in the hyp
column, the function works, but not if I run the substitution for string "one". I suspect this is because it only runs the test on the first row, and if it is true, then it checks the entire data frame.
Next I tried running the function within an lapply function:
df <- data.frame(lapply(df, function(x) {
if(grepl("one", df$test)) {
df[which(grepl("one", df$test)), ]$testcount = df[which(grepl("one", df$test)), ]$testcount - 3
}}))
This also didn't work, though I don't fully understand why. Somehow it ends up returning the output:
test hyp testcount hypcount
1 0 0 0 0
2 0 0 0 0
3 3 3 3 3
Lastly, I tried running it as an ifelse operation (here I switched to replacing for string "two" so I don't mistakenly think the function works on all rows):
df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3))
The strange thing is, this worked when I first applied it a few days ago. I tested it for the strings "two", "on", and "one", and it worked properly. Now, when I got around to applying it to my actual data, it didn't work. Further, when I went back to the testing to see what went wrong, it no longer is working. It simply returns the error: "Error in ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), : argument "no" is missing, with no default"
I have tried two solutions to this. First, I tried adding a statement into the "no" argument that would have no impact on my data:
df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3), T)
However, this causes it to return the output:
test hyp testcount hypcount
1 one two 1 3
2 two one 3 3
3 three onetwo 1 6
4 one one 1 3
5 onetwo two 0 3
Next I tried substituting a meaningful "no" argument:
df$testcount <- ifelse(grepl("two", df$test), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount - 3), (df[which(grepl("two", df$test)), ]$testcount = df[which(grepl("two", df$test)), ]$testcount))
But now it returns the output:
test hyp testcount hypcount
1 one two -3 3
2 two one 0 3
3 three onetwo -3 6
4 one one 0 3
5 onetwo two -3 3
I don't understand this output.
My question is, can anyone help me to understand why this is not working, and offer a solution? Thanks in advance!
Upvotes: 1
Views: 4184
Reputation: 33772
Try this function:
subtract_match <- function(column1, column2, text, df) {
df2 <- df
df2[, column2] <- ifelse(grepl(text, df[, column1]),
df[, column2] - nchar(text),
df[, column2])
df2
}
subtract_match("test", "testcount", "two", df1)
test hyp testcount hypcount
1 one two 3 3
2 two one 0 3
3 three onetwo 5 6
4 one one 3 3
5 onetwo two 3 3
subtract_match("hyp", "hypcount", "two", df1)
test hyp testcount hypcount
1 one two 3 0
2 two one 3 3
3 three onetwo 5 3
4 one one 3 3
5 onetwo two 6 0
Upvotes: 2
Reputation: 50668
I'm not sure I completely understand your problem, but the following reproduces your expected outcome for the two test cases.
# The first argument is the column where you want to find id
# I'm unsure about what you want to subtract; subtracting the entry from
# the count columns corresponds to setting the entry to 0
some_function <- function(col, id, df) {
idx <- which(colnames(df) == col);
df[df[, idx] == id, idx + 2] <- 0;
return(df);
}
some_function("test", "two", df);
# test hyp testcount hypcount
#1 one two 3 3
#2 two one 0 3
#3 three onetwo 5 6
#4 one one 3 3
#5 onetwo two 6 3
some_function("hyp", "two", df)
# test hyp testcount hypcount
#1 one two 3 0
#2 two one 3 3
#3 three onetwo 5 6
#4 one one 3 3
#5 onetwo two 6 0
Also, you say "contains the string (though is not identical to the string)", but for entries containing "onetwo" you do not subtract the count values. So it is complete matches that you are after?
Upvotes: 0