Ollie
Ollie

Reputation: 1946

Converting .docx to .txt in C++

Apologies for the duplicated (I think) question, new to C++ and have had a look around but still stuck!

I have found a bash script that takes a .docx file and outputs the plain text.

unzip -p filename.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

This works great over bash.

Then to use this in my code:

FILE *fp = popen("unzip -p filename.docx word/document.xml | sed -e 's/<[^>]\\{1,\\}>//g; s/[^[:print:]]\\{1,\\}//g'", "r");
char buf[1024];

if (fp == NULL) {
    cout << "Error";
}

while (fgets(buf, 1024, fp)) {
    /* do something with buf */
    cout << buf;
}

fclose(fp);

Nothing is printed as a result of this.

The code works with simple bash commands such as 'ls'

And help would be much appreciated!

Upvotes: 0

Views: 1376

Answers (1)

(I assume your program should run on some Linux system, or at least some POSIX one)

You should at least use pclose instead of fclose and you should care about the exit code returned by pclose.

As commented by Thab don't forget that \\ is an escape inside literal strings (the C++ compiler is lexing that as a single backslash in your string literal constant). You might use \\\\ or you could use C++11 raw string literals.

(you certainly should check, e.g. with your debugger, what is the string that popen is processing)

BTW, perhaps that popen failed and you did not catch that. Replace

if (fp == NULL) {
   cout << "Error";
}

(missing std::endl, so the output was not flushed)

with

if (fp == nullptr) {
  close << "popen failed:" << strerror(errno) << std::endl;
  exit(EXIT_FAILURE);
}

At last, I am not sure that this is the good approach to convert a .docx to .txt in batch mode on Linux. I would consider forking a Libreoffice or Openoffice process to do the job (perhaps libreoffice --headless --cat and some more options). I don't know all the details, you'll need to RTFM.

BTW, you should probably code some small shell script to do the conversion, check and test it in the terminal, and call that shell script using popen (hence avoiding a command line with backslashes).

Finally, your C++ code is too C-like. I would suggest using getline(3) so replacing

while (fgets(buf, 1024, fp)) {
  /* do something with buf */
  cout << buf;
}

with

char* linbuf = nullptr;
size_t linsiz = 0;
do {
  ssize_t linlen = getline(&linbuf, &linsiz, fp);
  if (linlen<=0) break;
  cout << std::string(linbuf, linlen) << std::endl;
} while (!feof(fp));
free (linbuf), linbuf=nullptr;

Of course replace at least your fclose(fp); with

int excod = pclose(fp);
if (excod != 0) 
  clog << "pclose failed " << excod << std::endl;

If you want to know more about the exit code, use waitpid(2) related macros on excod (e.g. WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG etc....)

Don't forget to compile with all warnings & debug info (g++ -Wall -Wextra -g) and to use the debugger (gdb), strace(1), & valgrind

Do care about flushing your buffers (using std::flush, std::endl, fflush(3) etc....) when starting a process with fork(2) (or system(3) or popen(3) which are fork-ing).

Upvotes: 5

Related Questions