Reputation: 585
In my data, there is a column like :
df <- data.frame(status = c("GET/sfuksd1567","GET/sjsh787","POST/hsfhuks","GET/sfukfiezd17","POST/fshks"), stringsAsFactors = FALSE)
I want to create another column automatically which is the indicator of the variable status and it only extracts the "GET" or "POST", like df$ind=c("GET","GET","POST","GET","POST")
.
I've tried the function substr
, but I didn't success.
Original data:
> df
status
1 GET/sfuksd1567
2 GET/sjsh787
3 POST/hsfhuks
4 GET/sfukfiezd17
5 POST/fshks
Expected result:
> df
status ind
1 GET/sfuksd1567 GET
2 GET/sjsh787 GET
3 POST/hsfhuks POST
4 GET/sfukfiezd17 GET
5 POST/fshks POST
Upvotes: 2
Views: 4242
Reputation: 886938
We can use stri_extract_first_words
from library(stringi)
library(stringi)
stri_extract_first_words(df$status)
#[1] "GET" "GET" "POST" "GET" "POST"
Another option from tidyr
is extract
extract(df, status, into='ind', '([^/]+)/.*', remove=FALSE)
Using the stri_extract_first_words
, the benchmarks are:
david <- function() sub('/.*', '', df$status)
etienne <- function() sapply(strsplit(df$status,'/'),`[[`,1)
akrun <- function()stri_extract_first_words(df$status)
df <- df[sample(1:nrow(df), 1e6, replace=TRUE),, drop=FALSE]
library(microbenchmark)
microbenchmark(david(), etienne(), akrun(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# david() 1.826192 1.824263 1.781562 1.814156 1.788085 1.699008 20
# etienne() 4.935629 5.159218 5.136180 5.198875 5.137107 5.930806 20
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
NOTE: There are other options in @David Arenburg's post. I am guessing the sub
version to be faster. I could be wrong.
Upvotes: 2
Reputation: 92282
You could simply remove everything after the backslash using regex
df$ind <- sub("/.*", "", df$status)
df
# status ind
# 1 GET/sfuksd1567 GET
# 2 GET/sjsh787 GET
# 3 POST/hsfhuks POST
# 4 GET/sfukfiezd17 GET
# 5 POST/fshks POST
Or if you don't like regex, you could try
library(tidyr)
separate(df, "status", c("ind", "status"))
Or
library(data.table) ## V1.9.6+
setDT(df)[, tstrsplit(status, "/")]
Or
read.table(text = df$status, sep = "/")
The last three options will just split the status
columns into two separate ones.
Upvotes: 11
Reputation: 3678
We have :
df<-data.frame(status=c("GET/sfuksd1567","GET/sjsh787","POST/hsfhuks","GET/sfukfiezd17","POST/fshks"),stringsAsFactors=F)
You can do:
df$ind<-sapply(1:nrow(df),function(x){strsplit(df$status,'/')[[x]][1]})
or
df$ind<-sapply(strsplit(df$status,'/'),`[[`,1)
Both return
df
status ind
1 GET/sfuksd1567 GET
2 GET/sjsh787 GET
3 POST/hsfhuks POST
4 GET/sfukfiezd17 GET
5 POST/fshks POST
Benchmark :
microbenchmark(david=sub("/.*", "", df$status),etienne=sapply(strsplit(df$status,'/'),`[[`,1))
Unit: microseconds
expr min lq mean median uq max neval cld
david 25.198 25.8985 27.64456 26.5980 27.298 116.189 100 a
etienne 62.294 63.3440 65.13979 63.8695 65.094 128.088 100 b
Upvotes: 3