Frank Mehlhop
Frank Mehlhop

Reputation: 2252

How to inject pdf into elasticsearch

I add the Ingest Attachment Processor Plugin on to Elastic.

Than I create a very simple pdf file.

This file (the content) I try to inject into Elastic. (see commands below)

But the try to find a word out of the file fails. (see third answer near the lower end of the commands)

What is wrong or which step is missing?

Do I need to add some pipeline?

Is the PUT of the pdf correct and do I need to set the pdf content into the content field of the PUT command?

console commands...

1 console:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

1 answer:

{
  "acknowledged" : true
}

2 console:

PUT my_index/_doc/001?pipeline=attachment
{
       "filename": "C:\\ELK-Stack\\Test.pdf",
       "data": "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
       "attachment": {
          "content_type": "application/rtf",
          "language": "ro",
          "content": "Test Test Dokument umgewandelt von word zu pdf. Hier wird getestet. Das ist der Test."
       },
       "title": "Quick"
}

2 answer:

{
  "_index" : "my_index",
  "_id" : "001",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

3 console:

GET /my_index/_search 
{
  "query": {
    "match": {
      "content": "Test"
    }
  }
}

3 answer:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

4 console:

GET /_search
{
    "query": {
        "match_all": {}
    }
}

4 answer:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "filename" : """C:\ELK-Stack\Test.pdf""",
          "data" :       "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
          "attachment" : {
            "content_type" : "text/plain; charset=windows-1252",
            "language" : "et",
            "content" : """Test
Test Dokument umgewandelt von wo
Hier wird getestet. Das ist der Test""",
            "content_length" : 77
          },
          "title" : "Quick"
        }
      }
    ]
  }
}

Upvotes: 0

Views: 334

Answers (1)

Frank Mehlhop
Frank Mehlhop

Reputation: 2252

Thanks to LeBigCat I find the solution.

I needed to add the full path to the field,

using: "attachment.content": "Test"

(instead of "content": "Test")

GET /my_index/_search 
{
  "query": {
    "match": {
      "attachment.content": "Test"
    }
  }
}

Upvotes: 0

Related Questions