rane
rane

Reputation: 931

How to efficiently clean linebreak in entire dataset R data.table

Line break is always the first target to prevent new line splitted in new row when perform "import text file" in Excel.Or export to other application with csv file importing. (The solution might be able to apply in clean another special mark in dataset. )

Goal clean all line break into space of entire dataset

dt[,lapply(.SD,gsub("\\n","",.SD))]

Problems

R freezed after applying the script with +50 cols & +3 million rows

What's wrong with the lapply approach above?And what is the preferred approach to clean certain things on entire table ?

Upvotes: 1

Views: 397

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34703

chinsoon12 is basically it -- use set for low-overhead by-reference column overwrite; just add fixed=TRUE to make the regex faster too:

for (jj in seq_len(ncol(dt))) set(dt, , jj, gsub('\n', '', dt[[jj]], fixed = TRUE))

BTW, \\n is different from \n. \n is the literal newline character, \\n is the string "\n", i.e., a backslash followed by n. You can see the difference thus:

cat('hey\nyou')
# hey
# you
cat('hey\\nyou')
# hey\nyou

Upvotes: 2

Related Questions