Reputation: 505

Splitting a column in a data frame by an nth instance of a character

I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.

For example:

View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"

And so on for about 120K rows. What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:

View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron

I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?

Thank you.

Upvotes: 0

Answers (3)

G. Grothendieck

Reputation: 270288

1) Assuming x as defined reproducibly in the Note at the end use read.table as shown. No regular expressions or packages are used.

read.table(text = Table$Column, sep = "|", header = FALSE, 
  as.is = TRUE, fill = TRUE)[6]

giving:

      V6
1   Gold
2 Silver
3 Silver
4   Iron
5 Copper
6   Iron

2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA in the into= argument.

library(dplyr)
library(tidyr)

Table %>% 
  separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")

giving:

  commodity
1      Gold
2    Silver
3    Silver
4      Iron
5    Copper
6      Iron

3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split= argument of strsplit is "" then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels | and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |) and collapse them together using paste.

data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
  wx <- which(cumsum(chars == "|") == 5)
  paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)

giving:

  commodities
1        Gold
2      Silver
3      Silver
4        Iron
5      Copper
6        Iron

Note

Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
  "|1||Rst|E|Silver||13||",
  "|1||RST|E|Silver||18||",
  "|1||KK|Y|Iron|y|12||",
  "|1||||Copper|Cpr|||E",
  "|1||||Iron|||12|F"), stringsAsFactors = FALSE)

Upvotes: 3

akrun

Reputation: 887911

1) We can use strsplit from base R on the delimiter | and extract the 6th element from the list of vectors

sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold"   "Silver" "Silver" "Iron"   "Copper" "Iron"

2) Or using regex (again from base R), use sub to extract the 6th word

sub("^([|][^|]+){4}[|]([^|]*).*", "\\2", 
       gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold"   "Silver" "Silver" "Iron"   "Copper" "Iron"

data

Table <- structure(list(Column = c("|1||KK|12|Gold||4K|", 
 "|1||Rst|E|Silver||13||", 
 "|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E", 
 "|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA, 
 -6L))

Upvotes: 1

Mislav

Reputation: 1573

You can try this:

df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]

Upvotes: 2

Splitting a column in a data frame by an nth instance of a character

Answers (3)

Note

data

Related Questions