How to efficiently clean linebreak in entire dataset R data.table

Question

Line break is always the first target to prevent new line splitted in new row when perform "import text file" in Excel.Or export to other application with csv file importing. (The solution might be able to apply in clean another special mark in dataset. )

Goal clean all line break into space of entire dataset

dt[,lapply(.SD,gsub("\n","",.SD))]

Problems

R freezed after applying the script with +50 cols & +3 million rows

What's wrong with the lapply approach above?And what is the preferred approach to clean certain things on entire table ?

MichaelChirico · Accepted Answer

chinsoon12 is basically it -- use set for low-overhead by-reference column overwrite; just add fixed=TRUE to make the regex faster too:

for (jj in seq_len(ncol(dt))) set(dt, , jj, gsub('
', '', dt[[jj]], fixed = TRUE))

BTW, \n is different from . is the literal newline character, \n is the string " ", i.e., a backslash followed by n. You can see the difference thus:

cat('hey
you')
# hey
# you
cat('hey\nyou')
# hey
you

How to efficiently clean linebreak in entire dataset R data.table

Goal clean all line break into space of entire dataset

Problems

Answers (1)

Related Questions