antecessor
antecessor

Reputation: 2800

Removing some text string and characters from a column in dataframe in R

I acknowledge this has been asked in different ways in the past. However, I get lost with gsub.

I have this dataframe:

df <- structure(list(Real = c(7.76, 5.55, 4.8, 4.68, 7.43, 4.59), Predicted = c(7.36, 
5.28, 5.12, 4.47, 7.48, 4.69), PdivR = c(0.95, 0.95, 1.07, 0.96, 
1.01, 1.02), Regression = c("`TLC`~`7_A`.152534", "`TLC`~`7_A`.158324", 
"`TLC`~`7_A`.611461", "`TLC`~`7_A`.627267", "`TLC`~`7_A`.674564", 
"`TLC`~`7_A`.675169")), row.names = c(NA, 6L), class = "data.frame")

Which can be displayed in this way:

head(df)
  Real Predicted PdivR         Regression
1 7.76      7.36  0.95 `TLC`~`7_A`.152534
2 5.55      5.28  0.95 `TLC`~`7_A`.158324
3 4.80      5.12  1.07 `TLC`~`7_A`.611461
4 4.68      4.47  0.96 `TLC`~`7_A`.627267
5 7.43      7.48  1.01 `TLC`~`7_A`.674564
6 4.59      4.69  1.02 `TLC`~`7_A`.675169

I would like to remove in the column Regression the point . and the numbers to the right of the point, and also this symbol (upper comma) in order to keep only TLC ~ 7_A.

Be aware that the number of numbers to the right are diverse along the column, but the behaviour is the same.

How could I do it with gsub?

Upvotes: 0

Views: 897

Answers (1)

akrun
akrun

Reputation: 886998

We can match the .(\\. - escaped as it is a metacharacter that matches any character) and one or more digits (\\d+) till the end ($) of the string and replace with blank ("") and wrap with gsub to match the backquote ("`") and remove it

df$Regression <- gsub("`", "", sub("\\.\\d+$", '', df$Regression))
df$Regression
[1] "TLC~7_A" "TLC~7_A" "TLC~7_A" "TLC~7_A" "TLC~7_A" "TLC~7_A"

Upvotes: 1

Related Questions