Brent R.
Brent R.

Reputation: 156

How to search multiple DOCX files for a string within a Word field?

Is there any Windows app that will search for a string of text within fields in a Word (DOCX) document? Apps like Agent Ransack and its big brother FileLocator Pro can find strings in the Word docs but seem incapable of searching within fields.

For example, I would like to be able to find all occurrences of the string "getProposalTranslations" within a collection of Word documents that have fields with syntax like this:

{ AUTOTEXTLIST  \t "<wr:out select='$.shared_quote_info' datasource='getProposalTranslations'/>" }

Note that string doesn't appear within the text of the document itself but rather only within a field. Essentially the DOCX file is just a zip file, I believe, so if there's a tool that can grep within archives, that might work. Note also that I need to be able to search across hundreds or perhaps thousands of files in many directories, so unzipping the files one by one isn't feasible. I haven't found anything on my own and thought I'd ask here. Thanks in advance.

Upvotes: 3

Views: 5584

Answers (1)

Dustin Nieffenegger
Dustin Nieffenegger

Reputation: 638

This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.

#!/bin/sh

# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML

# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML

# Iterate through each file passed to this script
for FILE in $@; do
{
    # unzip it into ~/tmp/WORDXML
    # 2>&1 > /dev/null discards all output to the terminal
    unzip $FILE 2>&1 > /dev/null

    # find all of the xml files
    find -type f -name '*.xml' | \

    # open them in xmllint to make them pretty. Discard errors.
    xargs xmllint --recover --format 2> /dev/null | \

    # search for and report if found
    grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"

    # remove the temporary contents
    rm -rf ~/tmp/WORDXML/*

}; done

# remove the temporary folder
rm -rf ~/tmp/WORDXML

Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...

Upvotes: 4

Related Questions