Reputation: 346
How to extract only text from paragraphs and table using python module from word document having objects like hyperlinks, images, attached excel sheet?
I tried docx2python
but it only works for simple "docx" files and not for which have links or excel file attached inside of them.
Upvotes: 0
Views: 656
Reputation: 11
Would this work?
import docx
doc = docx.Document(FILEPATH)
text = []
for i in range(num_of_pargrphs):
line = [run.text for run in doc.paragraphs[i].runs]
if line != []:
# If you need a list of paragraphs
# text.append(line)
result = ''.join(line)
# Printing out final results
print(result)
Also maybe for reading tables in documents you can use this: https://github.com/gressa-cpu/Python-Code-to-Share/blob/main/read_word_table.py
Upvotes: 1