Reputation: 5539
I am trying to replace non-printable characters ie extended ASCII characters from a HUGE string.
foreach (string line in File.ReadLines(txtfileName.Text))
{
MessageBox.Show( Regex.Replace(line,
@"\p{Cc}",
a => string.Format("[{0:X2}]", " ")
)); ;
}
this doesnt seem to be working.
EX: AAÂAA should be converted to AA AA
Upvotes: 3
Views: 2258
Reputation: 22813
Assuming the Encoding to be UTF8 try this:
string strReplacedVal = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(" "),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(line)
)
);
Upvotes: 1
Reputation: 20772
Since you are opening the file as UTF-8, it must be. So, its code units are one byte and UTF-8 has the very nice feature of encoding characters above ␡ with bytes exclusively above 0x7f and characters at or below ␡ with bytes exclusively at or below 0x7f.
For efficiency, you can rewrite the file in place a few KB at a time.
Note: that some characters might be replaced by more than one space, though.
// Operates on a UTF-8 encoded text file
using (var stream = File.Open(path, FileMode.Open, FileAccess.ReadWrite))
{
const int size = 4096;
var buffer = new byte[size];
int count;
while ((count = stream.Read(buffer, 0, size)) > 0)
{
var changed = false;
for (int i = 0; i < count; i++)
{
// obliterate all bytes that are not encoded characters between ␠ and ␡
if (buffer[i] < ' ' | buffer[i] > '\x7f')
{
buffer[i] = (byte)' ';
changed = true;
}
}
if (changed)
{
stream.Seek(-count, SeekOrigin.Current);
stream.Write(buffer, 0, count);
}
}
}
Upvotes: 0