Reputation: 175
I am working from horrible text data (2GB csv file) which includes practically all escape chars 0x00-0x1F spattered throughout the file. I attempted to read this into R for processing but cannot due to the EOFs (0x04):
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
So I thought sed would be a good use to remove all the non-printable junk in the file, but there seems to be some strangeness in how to represent the escape chars in the sed syntax. I have tried all of the following which do not seem to work:
Include only specified chars:
sed 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' IN.csv > OUT.csv
Identify range of non-printable in decimal or hex:
cat IN.csv | sed 's/[\d0-\d31]//g' > OUT.csv
cat IN.csv | sed s/[$'\x00'-$'\x1F']//g OUT.csv
cat IN.csv | sed 's/\x00-\x1F//g' > OUT.csv
and using Ctrl-VCtrl-D to produce this:
cat IN.csv | sed s/^D//g > OUT.csv
All the commands appear to execute, but the resulting file output does not remove the non-printable chars and appears to change the output in ways unexpected.
What I found that DOES WORK is this:
cat IN.csv | sed 's/'`echo -e "\x04"`'//g' > OUT.csv
or this:
cat IN.csv | sed 's/\x04//g' > test3.csv
However this only works for a single escape char. Is there a better way to address all of the non-printable chars at the same time in a single range without having to execute 1 command for each non-printable? I assume I must not be entering the syntax for a range properly.
Upvotes: 0
Views: 471
Reputation: 1628
You could try awk
:
awk '{gsub(/[[:punct:]]/,"")}1' your_file
or try sed
:
sed "s/[^a-z|0-9]//g;" orig_file > new_file
or try perl:
perl -pe 's/[^A-Za-z0-9\s]//g' orig_file > new_file
Upvotes: 0
Reputation: 27476
For removal (and transliteration) there is a better tool called tr
(translate or delete characters). You can remove non-printable characters using:
cat IN.csv | tr -cd '\11\12\15\40-\176' > OUT.csv
-d
- deletes characters mentioned, -c
inverts the ranges.
Or using the POSIX [:print:]
:
cat IN.csv | tr -cd '[:print:]' > OUT.csv
Upvotes: 2