Reputation: 111
Objective: I need to extract text from table (with Column Names as Name, address, contact number, email, etc) from .ppt files. For this I followed this approach:
I converted .ppt file to pdf and then extracted the data from pdf using PDFminer. The text extracted from pdf is not separated by any delimiter. Due to this it is very difficult to distinguish names and other fields in the table.
Probable solution I am working on:
I am stuck at first step of converting the file format from .ppt to .pptx. I couldn't find any solution for converting .ppt file format to .pptx formt in python.
Upvotes: 8
Views: 6553
Reputation: 14809
Most/all of the other proposed answers assume that PowerPoint is installed, then automate it using Python; from the comments, it seems there are problems with some/all of them.
Since PowerPoint is assumed, and since it has VBA built in, why not use that?
I've posted some code here that will do something to every file in a given folder: https://www.rdpslides.com/pptfaq/FAQ00536_Batch-_Do_something_to_every_file_in_a_folder.htm
For each file found it calls a routine called MyMacro. Change it to call SaveAsPPTX instead and use this:
Sub SaveAsPPTX(sOldName As String)
Dim oPres As Presentation
Dim sNewName As String
' Assuming you've stored the filename in string var sFilename:
Set oPres = Presentations.Open(sFilename, msoTrue, , msoFalse)
' Note: this will open the presentation windowlessly
' Saves vast amounts of time
' Strip off .PPT extension
sNewName = Mid$(sOldName, 1, Len(sOldName) - InStr(sOldName, "."))
' Add .PPTX extension
sNewName = sNewName & ".PPTX"
' Save to new name and close the file
oPres.SaveAs sNewName, ppSaveAsOpenXMLPresentation
oPres.Close
End Sub
Upvotes: 0
Reputation: 41
I have created this code hope this works for you :
import win32com.client
PptApp = win32com.client.Dispatch("Powerpoint.Application")
PptApp.Visible = True
PPtPresentation = PptApp.Presentations.Open(r'D:\ppt\sample.ppt')
PPtPresentation.SaveAs(r'D:\ppt\final.pptx', 24)
PPtPresentation.close()
PptApp.Quit()
edit: This also works on python3.11.9 by pip install pywin32
Upvotes: 3
Reputation: 26
Work perfect on anaconda 3 + jupyter notebook
from glob import glob
import re
import os
import win32com.client
paths = glob('C:\\yourfilePath\\*.ppt', recursive=True)
def save_as_pptx(path):
PptApp = win32com.client.Dispatch("Powerpoint.Application")
PptApp.Visible = True
PPtPresentation = PptApp.Presentations.Open(path)
PPtPresentation.SaveAs(path+'x', 24)
PPtPresentation.close()
PptApp.Quit()
for path in paths:
print(path.replace("\\yourfile\\", "\\yourfile_pptx\\"))
save_as_pptx(path)
Upvotes: 0
Reputation: 139
import os
os.system("libreoffice --headless --invisible --convert-to pptx *.ppt")
Upvotes: 0
Reputation: 55
For MacOS Homebrew users: install Apache Tika (brew install tika
)
The command-line interface works like this:
tika --text something.ppt > something.txt
And to use it inside python script:
import os
os.system("tika --text temp.ppt > temp.txt")
You will be able to do it and that is the only solution I have so far.
Upvotes: 0