Reputation: 6766
Is there a way to extract text from pptx, ppt, docx, doc and msg files on windows machine? I have few hundreds of these file and need some programmatic way. I would prefer Python. But I am open to other suggestions
I searched online and saw some discussions but they were applicable to linux machines
Upvotes: 1
Views: 1619
Reputation: 23443
I tried for word something with python-docx, to install it write pip install python-docx. I had a word doc called example with 4 lines of text in there that were grabbed in the right way like you see in the output below.
from docx import Document
d = Document("example.docx")
for par in d.paragraphs:
print(par.text)
output (the example.docx content):
Titolo
Paragrafo 1 a titolo di esempio
This is an example of text
This is the final part, just 4 rows
import os
from docx import Document
files = [f for f in os.listdir() if ".docx" in f]
text_collector = []
whole_text = ''
for f in files:
doc = Document(f)
for par in doc.paragraphs:
text_collector.append(par.text)
for text in text_collector:
whole_text += text + "\n"
print(whole_text)
In this code you are asked to choose the file that you want to join froma list that appears of the docx file in the folder.
import os
from docx import Document
files = [f for f in os.listdir() if ".docx" in f]
for n,f in enumerate(files):
print(n+1,f)
print()
print("Write the numbers of files you need separated by space")
inp = input("Which files do you want to join?")
desired = (inp.split())
desired = map(lambda x: int(x), desired)
list_to_join = []
for n in desired:
list_to_join.append(files[n-1])
text_collector = []
whole_text = ''
for f in list_to_join:
doc = Document(f)
for par in doc.paragraphs:
text_collector.append(par.text)
for text in text_collector:
whole_text += text + "\n"
print(whole_text)
Upvotes: 1