Exception
Exception

Reputation: 8379

How to read .docx file using F#

How can I read a .docx file using F#. If I use

System.IO.File.ReadAllText("D:/test.docx")

It is returning me some garbage output with beep sounds.

Upvotes: 1

Views: 1103

Answers (4)

Frank Hale
Frank Hale

Reputation: 1896

Try using the OpenXML SDK from Microsoft.

Also on the linked page is the Microsoft tool that you can use to decompile the office 2007 files. The decompiled code can be quite lengthy even for simple documents though so be warned. There is a big learning curve associated with OpenXML SDK. I'm finding it quite difficult to use.

Upvotes: 1

Gene Belitski
Gene Belitski

Reputation: 10350

Here is a F# snippet that may give you a jump-start. It successfully extracts all text contents of a Word2010-created .docx file as a string of concatenated lines:

open System
open System.IO
open System.IO.Packaging
open System.Xml

let getDocxContent (path: string) =
    use package = Package.Open(path, FileMode.Open)
    let stream = package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream()
    stream.Seek(0L, SeekOrigin.Begin) |> ignore
    let xmlDoc = new XmlDocument()
    xmlDoc.Load(stream)
    xmlDoc.DocumentElement.InnerText

printfn "%s" (getDocxContent @"..\..\test.docx")

In order to make it working do not forget to reference WindowsBase.dll in your VS project.

Upvotes: 3

Scott Weinstein
Scott Weinstein

Reputation: 19117

System.IO.File.ReadAllText has type of string -> string.

Because a .docx file is a binary file, it's probable that some of the chars in the strings have the bell character. Rather than ReadAllText, look into Word automation, the Packaging, or the OpenXML APIs

Upvotes: 1

Simon Mourier
Simon Mourier

Reputation: 138960

.docx files follow Open Packaging Convention specifications. At the lowest level, they are .ZIP files. To read it programmatically, see example here:

A New Standard For Packaging Your Data

Packages and Parts

Using F#, it's the same story, you'll have to use classes in the System.IO.Packaging Namespace.

Upvotes: 1

Related Questions