Reputation: 505
I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.
For example:
View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"
And so on for about 120K rows. What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:
View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron
I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?
Thank you.
Upvotes: 0
Views: 1318
Reputation: 270288
1) Assuming x as defined reproducibly in the Note at the end use read.table
as shown. No regular expressions or packages are used.
read.table(text = Table$Column, sep = "|", header = FALSE,
as.is = TRUE, fill = TRUE)[6]
giving:
V6
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA
in the into=
argument.
library(dplyr)
library(tidyr)
Table %>%
separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")
giving:
commodity
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split=
argument of strsplit
is ""
then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels |
and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |
) and collapse them together using paste
.
data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
wx <- which(cumsum(chars == "|") == 5)
paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)
giving:
commodities
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||",
"|1||KK|Y|Iron|y|12||",
"|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F"), stringsAsFactors = FALSE)
Upvotes: 3
Reputation: 887911
1) We can use strsplit
from base R
on the delimiter |
and extract the 6th element from the list
of vector
s
sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
2) Or using regex
(again from base R
), use sub
to extract the 6th word
sub("^([|][^|]+){4}[|]([^|]*).*", "\\2",
gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
Table <- structure(list(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1
Reputation: 1573
You can try this:
df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]
Upvotes: 2