DavidK
DavidK

Reputation: 2564

Elasticsearch - Extracting PDF content and encoding with base64

I want to be able to extract content from a PDF file and to be able to search within that content using ElasticSearch.

I did install elasticsearch/elasticsearch-mapper-attachments/2.6.0

I have created a new index named "docs".

I did create a file named "tmp.json" with that content :

{"title": "file.pdf", "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="}

I did execute the following :

curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
                      "attachment": {
                         "properties" : {
                         'file" : {
                               "type" : "attachment",
                               "fields" : {
                                   "title" : {"store":"yes"},
                                   "file":{
                                       "type":"string",
                                       "term_vector":"with_positions_offsets", 
                                       "store":"yes"}
                                   }
                                }
                            }
                        }
                    }'

and the following :

curl -X POST "http://localhost:9200/docs/attachment" -d @tmp.json

The problem is that the content is stored as it is in the file.

I was expecting the content to be decoded, like so :

base64.b64decode("IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==")

That gives :

b'"God Save the Queen" (alternatively "God Save the King"'

To encode in base64, here what I do :

import json, base64
file64 = base64.b64encode(open('file.pdf', "rb").read()).decode('ascii')
f = open('tmp.json', 'w')
data = {"file":file64, "title":fname}
json.dump(data,f)
f.close()

I would like to be able to see the content using kibana (but for now I see only the base64 data ...)

Upvotes: 2

Views: 1983

Answers (1)

DavidK
DavidK

Reputation: 2564

This didn't work :

curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
                  "attachment": {
                     "properties" : {
                     "content" : {
                           "type" : "attachment",
                           "fields" : {
                               "title" : {"store":"yes"},
                               "content":{
                                   "type":"string",
                                   "term_vector":"with_positions_offsets", 
                                   "store":"yes"}
                               }
                            }
                        }
                    }
                }'

This worked, and I can see the content of the PDF through Kibana :

curl -X PUT "http://localhost:9200/docs" -d '{
                              "mappings" : {
                                "attachment" : {
                                  "properties" : {
                                    "content" : {
                                      "type" : "attachment",
                                      "fields" : {
                                        "content"  : { "store" : "yes" },
                                        "author"   : { "store" : "yes" },
                                        "title"    : { "store" : "yes"},
                                        "date"     : { "store" : "yes" },
                                        "keywords" : { "store" : "yes", "analyzer" : "keyword" },
                                        "name"    : { "store" : "yes" },
                                        "content_length" : { "store" : "yes" },
                                        "content_type" : { "store" : "yes" }
                                      }
                                    }
                                  }
                                }
                              }
                            }'

Upvotes: 1

Related Questions