gogolaygo
gogolaygo

Reputation: 325

Create a new column that counts the number of a sub-string in a string column?

I need to create some new columns that count 1 if a sub-string appears one or more times in a string column. Like this:

Existing Column         New Col (True if apple)    New Col (True if banana)
apple, apple, orange            1                              0
banana, banana, orange          0                              1
apple, banana, orange           1                              1

Upvotes: -1

Views: 1224

Answers (3)

alistaire
alistaire

Reputation: 43334

So I thought you wanted columns of counts (not whether strings are contained) the first time I read the question (the previous edit), but it's sort of useful code anyway, so I left it. Here are options for both base R and the stringr package:

First let's make a sample data.frame with similar data

# stringsAsFactors = FALSE would be smart here, but let's not assume...
df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))   

which looks like

> df
           x
1 a, b, c, a
2    b, b, c
3       d, a

Base R

Use strsplit to make a list of vectors of separated strings, using as.character to coerce factors to a useful form,

list <- strsplit(as.character(df$x), ', ')

then make a list of unique strings

lvls <- unique(unlist(list))

Making Contains Columns

Loop over the rows of the data.frame/list with sapply. (All sapply functions in this answer could be replaced with for loops, but that's generally considered poor style in R for speed reasons.) Test if the unique strings are in each, and change to integer format. Set the result (transposed) to a new column of df, one for each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])}))

> df
           x a b c d
1 a, b, c, a 1 1 1 0
2    b, b, c 0 1 1 0
3       d, a 1 0 0 1

To keep values as Boolean TRUE/FALSE instead of integers, just remove as.integer.

Making Count Columns

Loop over the rows of the data.frame/list with the outside sapply, while the inner one loops over the unique strings in each, and counts the occurrences by summing TRUE values. Set the result (transposed) to a new column of df, one for each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){
    sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])})
}))

> df
           x a b c d
1 a, b, c, a 2 1 1 0
2    b, b, c 0 2 1 0
3       d, a 1 0 0 1

stringr

stringr can make these tasks much more straightforward.

First, find unique strings in df$x. Split strings with str_split (which can take a factor), flatten them into a vector with unlist, and find unique ones:

library(stringr)
lvls <- unique(unlist(str_split(df$x, ', ')))

Making Contains Columns

str_detect allows us to only loop over the unique strings, not rows:

df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))})

Making Count Columns

str_count simplifies our syntax dramatically, again only looping over lvls:

df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)})

Results for both are identical to those in base R above.

Upvotes: 3

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193497

Using "df" from @user3949008's answer, you can also try cSplit_e from my "splitstackshape" package:

library(splitstackshape)
cSplit_e(df, "Fruit", ",", type = "character", fill = 0)
#                 Fruit Fruit_apple Fruit_banana Fruit_orange Fruit_pear
# 1 apple,orange,orange           1            0            1          0
# 2  banana,banana,pear           0            1            0          1
# 3 apple,banana,orange           1            1            1          0

You can always drop the columns you're not interested in later.

If you're after counts, you can try mtabulate from "qdapTools":

library(qdapTools)
mtabulate(strsplit(df$Fruit, ","))
#   apple banana orange pear
# 1     1      0      2    0
# 2     0      2      0    1
# 3     1      1      1    0

Upvotes: 2

Gopala
Gopala

Reputation: 10473

So, without full details, it is very hard to know exactly what you are looking for. But, if you are looking for number of times a given string occurs and adding as a column to the original data, here is one approach that works (replicating your data input):

df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)

df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('apple' == x))
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('banana' == x))

This will only work when you know the specific strings you identified to add as columns. But, should give you an idea of how to split strings, count how many of a given are in that split list, etc. Hope this helps.

Output from the above code should be this:

                Fruit appleCount bananaCount
1 apple,orange,orange          1           0
2  banana,banana,pear          0           2
3 apple,banana,orange          1           1

If you are not looking for the count of times a given string occurs, but just only a true/false (0/1) of whether the string occurs or not, you can use this slightly modified code to get that result:

df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)
df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) 'apple' %in% x)
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) 'banana' %in% x)

The output will be then as follows:

            Fruit appleCount bananaCount
1 apple,orange,orange       TRUE       FALSE
2  banana,banana,pear      FALSE        TRUE
3 apple,banana,orange       TRUE        TRUE

If you really want 0/1, you can use as.integer to convert a logical column to integer value.

Upvotes: 2

Related Questions