0bserver07
0bserver07

Reputation: 3471

Text Conversion to Database

Hello creative Developers and night rangers of StackOverflow, I have a customer who has around 20 thousand words for a dictionary in Microsoft Document files.

He created it around 1 decade ago, now I have to load these *.doc files contents to a database to create a dictionary for the customer.

My Question is where to start for changing a Text based in Columns to any sort of Database?

I'm thinking about using RegEx and using some patterns. So any cool suggestions?

Upvotes: 0

Views: 204

Answers (2)

Zev Spitz
Zev Spitz

Reputation: 15327

Sample in C#:

For starters, add a reference to Microsoft.Office Interop.Word. Then you can do some basic parsing:

var wdApp = new Application();
var dict = new Dictionary<string, string>();
//paths is some collection of paths to the Word documents
//You can use Directory.EnumerateFiles to get such a collection from a folder
//EnumerateFiles also allows you to filter the files, say to only .doc
foreach (var path in paths) {
    var wdDoc = wdApp.Documents.Open(path);
    foreach (Paragraph p in wdDoc.Paragraphs) {
        var text = p.Range.Text;
        var delimiterPos = text.IndexOf(";");
        dict.Add(
            text.Substring(0, delimiterPos - 1),
            text.Substring(delimiterPos + 1)
        );
    }
    wdDoc.Close();
}
//This can be done more cleanly using LINQ, but Dictionary<TKey,TValue> doesn't have an AddRange method.
//OTOH, such a method can be easily added as an extension method, taking IEnumerable<KeyValuePair<TKey,TValue>>

For more complex parsing, you can save each item as a new textfile:

var newPaths =
    from path in paths
    select new {
        path,
        //If needed, add some logic to put the textfile in a different folder
        newPath = Path.ChangeExtension(path, ".txt")
    };
var wdApp = new Application();
foreach (var item in newPaths) {
    var wdDoc = wdApp.Documents.Open(item.path);
    wdDoc.SaveAs2(
        FileName: item.newPath,
        FileFormat: WdSaveFormat.wdFormatText
    );
    wdDoc.Close();
}

You may also need to create a file named schema.ini and put it in the same folder as the text files (more details on the syntax here):

//assuming the delimiter is a ;
File.WriteAllLines(schemaPath,
    from item in newPaths
    select String.Format(@"
        [{0}]
        Format=Delimited(;)
    ", item.filename)
);

Then, you can query the resulting text files using SQL statements, via the OleDbConnection, OleDbCommand, and OleDbReader classes.

foreach (var item in newPaths) {
    var connectionString = @"
        Provider=Microsoft.Jet.OLEDB.4.0;
        Extended Properties=""text;HDR=NO;IMEX=1;""
        Data Source=" + item.newPath;
    using (var conn = new OleDbConnection(connectionString)) {
        using (var cmd = conn.CreateCommand()) {
            cmd.CommandText = String.Format(@"
                SELECT *
                FROM [{0}]
            ", item.newPath);
            using (var rdr = cmd.ExecuteReader()) {
                //parse file contents here
            }
        }
    }
}

Upvotes: 1

d_inevitable
d_inevitable

Reputation: 4451

The main problem here is not that the data is stored in text, but that it is stored in .doc files and in tables there and that they are in many files.

So what you will need to do is:

  • Combine it into one file.
  • Convert it into sql text
  • Convert it into a text file

You can do this in any order, but the order will change the methodology a lot.

You could create MS-Word macros (in Basic), that would convert it into SQL text and combines the documents into one.

Or you could convert the document into RTF, and then run write script in any language you like to do the rest.

Regular expressions surely will be handy, but can't say how they should look like, because you did not specify how the files look like.

If there are not so many files, you could consider using copy & paste to put it into a simple text file. That will get rid of the table too. The result might be ugly, but it would still be structure so that I can be converted into sql.

Upvotes: 1

Related Questions