Reputation: 325
I need to create some new columns that count 1 if a sub-string appears one or more times in a string column. Like this:
Existing Column New Col (True if apple) New Col (True if banana)
apple, apple, orange 1 0
banana, banana, orange 0 1
apple, banana, orange 1 1
Upvotes: -1
Views: 1224
Reputation: 43334
So I thought you wanted columns of counts (not whether strings are contained) the first time I read the question (the previous edit), but it's sort of useful code anyway, so I left it. Here are options for both base R and the stringr
package:
First let's make a sample data.frame with similar data
# stringsAsFactors = FALSE would be smart here, but let's not assume...
df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))
which looks like
> df
x
1 a, b, c, a
2 b, b, c
3 d, a
Use strsplit
to make a list of vectors of separated strings, using as.character
to coerce factors to a useful form,
list <- strsplit(as.character(df$x), ', ')
then make a list of unique strings
lvls <- unique(unlist(list))
Loop over the rows of the data.frame/list with sapply
. (All sapply
functions in this answer could be replaced with for
loops, but that's generally considered poor style in R for speed reasons.) Test if the unique strings are in each, and change to integer format. Set the result (t
ransposed) to a new column of df
, one for each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])}))
> df
x a b c d
1 a, b, c, a 1 1 1 0
2 b, b, c 0 1 1 0
3 d, a 1 0 0 1
To keep values as Boolean TRUE
/FALSE
instead of integers, just remove as.integer
.
Loop over the rows of the data.frame/list with the outside sapply
, while the inner one loops over the unique strings in each, and counts the occurrences by summing TRUE
values. Set the result (t
ransposed) to a new column of df
, one for each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){
sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])})
}))
> df
x a b c d
1 a, b, c, a 2 1 1 0
2 b, b, c 0 2 1 0
3 d, a 1 0 0 1
stringr
stringr
can make these tasks much more straightforward.
First, find unique strings in df$x
. Split strings with str_split
(which can take a factor), flatten them into a vector with unlist
, and find unique ones:
library(stringr)
lvls <- unique(unlist(str_split(df$x, ', ')))
str_detect
allows us to only loop over the unique strings, not rows:
df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))})
str_count
simplifies our syntax dramatically, again only looping over lvls
:
df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)})
Results for both are identical to those in base R above.
Upvotes: 3
Reputation: 193497
Using "df" from @user3949008's answer, you can also try cSplit_e
from my "splitstackshape" package:
library(splitstackshape)
cSplit_e(df, "Fruit", ",", type = "character", fill = 0)
# Fruit Fruit_apple Fruit_banana Fruit_orange Fruit_pear
# 1 apple,orange,orange 1 0 1 0
# 2 banana,banana,pear 0 1 0 1
# 3 apple,banana,orange 1 1 1 0
You can always drop the columns you're not interested in later.
If you're after counts, you can try mtabulate
from "qdapTools":
library(qdapTools)
mtabulate(strsplit(df$Fruit, ","))
# apple banana orange pear
# 1 1 0 2 0
# 2 0 2 0 1
# 3 1 1 1 0
Upvotes: 2
Reputation: 10473
So, without full details, it is very hard to know exactly what you are looking for. But, if you are looking for number of times a given string occurs and adding as a column to the original data, here is one approach that works (replicating your data input):
df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)
df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('apple' == x))
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('banana' == x))
This will only work when you know the specific strings you identified to add as columns. But, should give you an idea of how to split strings, count how many of a given are in that split list, etc. Hope this helps.
Output from the above code should be this:
Fruit appleCount bananaCount
1 apple,orange,orange 1 0
2 banana,banana,pear 0 2
3 apple,banana,orange 1 1
If you are not looking for the count of times a given string occurs, but just only a true/false (0/1) of whether the string occurs or not, you can use this slightly modified code to get that result:
df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)
df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) 'apple' %in% x)
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) 'banana' %in% x)
The output will be then as follows:
Fruit appleCount bananaCount
1 apple,orange,orange TRUE FALSE
2 banana,banana,pear FALSE TRUE
3 apple,banana,orange TRUE TRUE
If you really want 0/1, you can use as.integer
to convert a logical column to integer value.
Upvotes: 2