Varun
Varun

Reputation: 1321

R Remove Unicode using regular expression

I have a dataframe that looks like this

df=data.frame(ID=c(1,2,3),hashtag=c('c("#job", "#inclusion<U+0085>", "#driver", "#splitme")','c("#job", "#inclusion<U+0085>", "#driver")','c("#job", "#inclusion<U+0085>")'))

I'd first do some cleaning up, then split column hashtag into multiple columns based on the number of hashtags in each cell. So for example, the first column has 4 hashtags, hence will be split into four different columns with #job,#inclusion,diversity,splitme

I tried the following

#Clean up
#Remove inverted commas
df$hashtag <- gsub('"', '', df$hashtag)

#Remove brackets
df$hashtag <-gsub("c\\(|\\)", "", df$hashtag)

#Then Split columns
df_split=df%>% separate(hashtag, c("A", "B","C","D"),sep=', ',extra = "drop")

When I try to remove the unicode using the following line of code, nothing happens.

#Remove unicode
df$hashtag <-gsub("\\<|\\>", "", df$hashtag)

Any ideas on what could be the right solution to this?

Upvotes: 1

Views: 838

Answers (1)

CPak
CPak

Reputation: 13591

You didn't specify the output but you can follow this

# vector of hashtag column
v <- df$hashtag

w <- gsub("[#]", "", v)
# [1] "job, inclusion<U+0085>, driver, splitme"
# [2] "job, inclusion<U+0085>, driver"         
# [3] "job, inclusion<U+0085>"

ans <- gsub("[<].+[>]", "", w)
# [1] "job, inclusion, driver, splitme" "job, inclusion, driver"         
# [3] "job, inclusion"

unlist(strsplit(ans, ","))
# [1] "job"        " inclusion" " driver"    " splitme"   "job"       
# [6] " inclusion" " driver"    "job"        " inclusion"

Upvotes: 1

Related Questions