SyncMaster
SyncMaster

Reputation: 9956

How to get rid of special characters at the beginning, while using File.ReadAllLines in C#

I tried string[] file = File.ReadAllLines(file_name) to read a word file.

In debug mode i found that the first few arguments of the string array file are having values like

"��ࡱ�0\0\0\0>\0\0��\t\0\0\0\0\0". How can i get rid of this.

In certain files the first 3 arguments of the file[] are filled with these while for few files only the first argument is filled with these unreable characters.

What is the problem and how can i get rid of this.? But my word file does not even have a blank line at the beginning.

Upvotes: 0

Views: 538

Answers (4)

Oded
Oded

Reputation: 499182

Word files are not simple text files, so will have additional binary information embedded.

You should use a library that will read word documents if you want to extract the text properly, instead of File.ReadAllLines.

Here are a couple of such libraries.

Upvotes: 1

Darin Dimitrov
Darin Dimitrov

Reputation: 1039238

File.ReadAllLines is intended for text files. Word files are not text files. To read Word files you might need a library.

Upvotes: 2

Yuriy Faktorovich
Yuriy Faktorovich

Reputation: 68707

The problem is you're not opening the file with the correct encoding. Here is a guide to opening and creating Word documents from C#.

Upvotes: 3

Kane
Kane

Reputation: 16812

If you are using .NET 3.5 then I'd suggest that you use a LINQ where clause to return only the lines that you're interested in.

string[] file = File.ReadAllLines(file_name).Where(line => !line.StartsWith("��")).ToArray();

You could also use some form of regular expression instead of the line.StartsWith() method.

Note: If you are reading Microsoft Office Word files I'd recommend that you use the COM Interop or 3rd party library to read the MS Word Document (you'll find it much easier than trying to parse the file yourself).

Upvotes: 1

Related Questions