dj_paige
dj_paige

Reputation: 353

Extract "words" from a string

I have a table with 153 rows by 9 columns. My interest is the character string in the first column, I want to extract the fourth word and create a new list from this fourth word, this list will be 153 rows, 1 column.

An example of the first two rows of column 1 of this database table:

[1] Resistance_Test DevID (Ohms) 428
[2] Diode_Test SUBLo (V) 353

"Words" are separated by spaces, so the fourth word of the first row is "428" and the fourth word of the second row is "353". How can I create a new list containing the fourth word of all 153 rows?

Upvotes: 3

Views: 8694

Answers (4)

panman
panman

Reputation: 1341

You could use word() from the stringrpackage:

> x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
> library(stringr)
> word(string = x, start = 4, end = 4)
[1] "428" "353"

Specifying the position of both the start and end words to be the same, you will always get the fourth word.

I hope this helps.

Upvotes: 2

Yollanda Beetroot
Yollanda Beetroot

Reputation: 333

If you are not familiar with regular expressions, the function strsplit can help you :

data <- c('Resistance_Test DevID (Ohms) 428', 'Diode_Test SUBLo (V) 353')
unlist(lapply(strsplit(data, ' '), function(x) x[4]))
[1] "428" "353"

Upvotes: 1

akrun
akrun

Reputation: 887048

We can use sub. We match the pattern one or more non-white space (\\S+) followed by one or more white space (\\s+) that gets repeated 3 times ({3}) followed by word that is captured in a group ((\\w+)) followed by one or more characters. We replace it by the second backreference.

sub("(\\S+\\s+){3}(\\w+).*", "\\2", str1)
#[1] "428" "353"

This selects by the nth word, so

 sub("(\\S+\\s+){3}(\\w+).*", "\\2", str2)
 #[1] "428" "353" "428"

Another option is stri_extract

 library(stringi)
 stri_extract_last_regex(str1, "\\w+")
 #[1] "428" "353"

data

str1 <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
str2 <- c(str1, "Resistance_Test DevID (Ohms) 428 something else")

Upvotes: 1

Andrie
Andrie

Reputation: 179418

Use gsub() with a regular expression

x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
ptn <- "(.*? ){3}"
gsub(ptn, "", x)

[1] "428" "353"

This works because the regular expression (.*? ){3} finds exactly three {3} sets of characters followed by a space (.*? ), and then replaces this with ane empty string.

See ?gsub and ?regexp for more information.


If your data has structure that you don't mention in your question, then possibly the regular expression becomes even easier.

For example, if you are always interested in the last word of each line:

ptn <- "(.*? )"
gsub(ptn, "", x)

Or perhaps you know for sure you can only search for digits and discard everything else:

ptn <- "\\D"
gsub(ptn, "", x)

Upvotes: 2

Related Questions