Reputation: 8379
How can I read a .docx file using F#. If I use
System.IO.File.ReadAllText("D:/test.docx")
It is returning me some garbage output with beep sounds.
Upvotes: 1
Views: 1103
Reputation: 1896
Try using the OpenXML SDK from Microsoft.
Also on the linked page is the Microsoft tool that you can use to decompile the office 2007 files. The decompiled code can be quite lengthy even for simple documents though so be warned. There is a big learning curve associated with OpenXML SDK. I'm finding it quite difficult to use.
Upvotes: 1
Reputation: 10350
Here is a F# snippet that may give you a jump-start. It successfully extracts all text contents of a Word2010-created .docx
file as a string of concatenated lines:
open System
open System.IO
open System.IO.Packaging
open System.Xml
let getDocxContent (path: string) =
use package = Package.Open(path, FileMode.Open)
let stream = package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream()
stream.Seek(0L, SeekOrigin.Begin) |> ignore
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
xmlDoc.DocumentElement.InnerText
printfn "%s" (getDocxContent @"..\..\test.docx")
In order to make it working do not forget to reference WindowsBase.dll
in your VS project.
Upvotes: 3
Reputation: 19117
System.IO.File.ReadAllText
has type of string -> string
.
Because a .docx file is a binary file, it's probable that some of the chars in the strings have the bell character. Rather than ReadAllText
, look into Word automation, the Packaging, or the OpenXML APIs
Upvotes: 1
Reputation: 138960
.docx files follow Open Packaging Convention specifications. At the lowest level, they are .ZIP files. To read it programmatically, see example here:
A New Standard For Packaging Your Data
Using F#, it's the same story, you'll have to use classes in the System.IO.Packaging Namespace.
Upvotes: 1