skumar
skumar

Reputation: 353

Regular expression in R - remove everything after last symbol

Using column RelatedToText below, I want to create 2 new columns Coverage_Type and Name.

If we can find content before and after last "-" sign, then I think I should be good. But then, if you see the last case, there is a "-" sign between parts of a name i.e. between Mayur and Cook.

My questions are 2 fold : first, how should I extract content before and after the last "-" sign?, and second, how should I extract content correctly if name contains a dash within itself as quoted above?

RelatedToTxt                        Coverage_Type           Name
Collision - NAWADA REALTY, INC      Collision               NAWADA REALTY, INC
Collision - Don Cooks               Collision               Don Cooks
Pro Dam - Veh - Spl Lt - Raj Perk   Pro Dam - Veh - Spl Lt  Raj Perk
Rental Reimbursement - Mayur-Cook   Rental Reimbursement    Mayur-Cook

Example data:

RelatedToTxt <- c("Collision - NAWADA REALTY, INC", "Collision - Don Cooks",
    "Pro Dam - Veh - Spl Lt - Raj Perk", "Rental Reimbursement - Mayur-Cook")

Upvotes: 0

Views: 904

Answers (1)

Jota
Jota

Reputation: 17611

Try using strsplit to split the text into two columns. You can split on the final " - " using this regex: .+\\K\\s-\\s. The .+\\K uses a greedy pattern .+ to match as much as it can and then drop what has been match, using \\K, before matching a space-hyphen-space pattern. The greediness of .+ allows it to skip over the hyphens in "Pro Dam - Veh - Spl Lt".

strsplit(RelatedToTxt, ".+\\K\\s-\\s", perl = TRUE)

#[[1]]
#[1] "Collision"          "NAWADA REALTY, INC"
#
#[[2]]
#[1] "Collision" "Don Cooks"
#
#[[3]]
#[1] "Pro Dam - Veh - Spl Lt" "Raj Perk"
#
#[[4]]
#[1] "Rental Reimbursement" "Mayur-Cook"

The output can be turned into two columns with

do.call(rbind, strsplit(RelatedToTxt, ".+\\K\\s-\\s", perl = TRUE))

Upvotes: 1

Related Questions