Ranjan
Ranjan

Reputation: 25

How to extract the language of a pdf document

I am trying to extract the language of any general pdf document and set it in CMS using python. I am trying to extract it using /Lang attribute, here is the code sample:

pdfFileLang = findInDict('/Lang',pdfFile.resolvedObjects())



def findInDict(needle,indirectObjectDict):
    """ Returns the PDF Language """
    haystack = indirectObjectDict[0]
    LOG('pypdfutils.py getPdfLanguage key haystack',INFO,str(haystack))
    for key in haystack.keys():
        LOG('pypdfutils.py getPdfLanguage key',INFO,str(key))
        try:
             value = haystack[key]
             LOG('pypdfutils.py getPdfLanguage value',INFO,str(value))
             if key == needle:
                 return value
             else:
                 LOG('pypdfutils.py getPdfLanguage value1',INFO,str(value))
             internalDict = value.keys()
             LOG('pypdfutils.py getPdfLanguage key Dict',INFO,str(internalDict))
             if type(value) == types.DictType:
                 internalDict = value.keys()                 
             else:
                 LOG('pypdfutils.py getPdfLanguage value2',INFO,str(value))
                 for internalkey in internalDict.keys():
                     internalvalue = internalDict[internalkey]
                     LOG('pypdfutils.py getPdfLanguage key internalvalue',INFO,str(internalvalue))
                     if type(internalvalue) == types.DictType and internalvalue.has_key(needle):
                         return internalvalue[needle]                                  
        except Exception,e:
            LOG('pypdfutils.py getPdfLanguage',INFO,str(e))
            continue

But when I am seeing the LOGS I find no such attribute "/Lang" in the dictionary.

Upvotes: 1

Views: 2390

Answers (2)

Oleg Buckridge
Oleg Buckridge

Reputation: 71

It looks like you tried to search for 'Lang' key through all dictionaries in your PDF file.

To check the language information from a PDF file, you need to check 'Lang' entry in the catalog. However the existance of this entry depends on PDF creating software which is used to create the PDF file and most PDF files do not have this entry.

I do not understand Python code but I belive that the PDF library you are using will provide you the access to the trailer (dictionary) or catalog (root) dictionray. If you have an access to the trailer dict, get 'Root' value from the dict. This is the indirect reference to Catalog (Root) dictionary. Then resolve this reference to dict to get the catalog dictionary. Taking /Lang value from this catalog dict will give you the attribute.

Try, the following

catalog = pdfFile.trailer['/Root'].getObject()
if has_key("/Lang"):
    lang = catalog['/Lang'].getObject()

Please note that I am not a Python programmer and the code snippet above is my first Python code (I am not sure it is working. :-)

Please refer pypdf reference at http://sourcecodebrowser.com/python-pypdf/1.10/classpy_pdf_1_1pdf_1_1_pdf_file_reader.html#a92be75503c895367083a846b3060e632

Upvotes: 2

David van Driessche
David van Driessche

Reputation: 7046

As explained in the PDF specification: http://www.adobe.com/devnet/pdf/pdf_reference.html

There is a "/Lang" key in the Document Catalog. In my version of the PDF specification this is explained in section 7.7.2.

This language key defines the language assumed for the complete document, with the exception of those parts that are marked differently.

So, two caveats: 1) This "/Lang" key is optional. If it's not there the PDF specification says the language is undefined.

2) This "/Lang" key may be overwritten by other elements in the file. So the entire document may be English, but specific sentences on page 101 may redefine the language as French for example.

In your case, your algorithm should first try to find the overall document language as defined above. If that's not there it's up to you what to do. You could search the complete document for "/Lang" keys to see if you find any other, but if you find multiple, you'll have to decide what that means...

Upvotes: 1

Related Questions