YNR
YNR

Reputation: 875

How split a field to words by ingest pipeline in Kibana

I have created an ingest pipeline as below to split a field into words:

POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "String cutting processing",
        "processors": [
            {
                "split": {
                    "field": "foo",
                    "separator": "|"
                }
            }
        ]
    },
    "docs": [
        {
            "_source": {
                "foo": "apple|time"
            }
        }
    ]
}

but it split the field into characters:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "foo" : [
            "a",
            "p",
            "p",
            "l",
            "e",
            "|",
            "t",
            "i",
            "m",
            "e"
          ]
        }
      }
    }
  ]
}

If I replace the separator with a comma, the same pipeline split the field to words:

POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "String cutting processing",
        "processors": [
            {
                "split": {
                    "field": "foo",
                    "separator": ","
                }
            }
        ]
    },
    "docs": [
        {
            "_source": {
                "foo": "apple,time"
            }
        }
    ]
}

then the output would be:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "foo" : [
            "apple",
            "time"
          ]
        }
      }
    }
  ]
}

How can I split the field into words when the separator is "|"? My next question is how could I apply this ingest pipeline to an existing index? I tried this solution, but it doesn't work for me.

Edit

Here is the whole pipeline with the document that will assign two parts to two columns:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": """combined fields are text that contain  "|" to separate two fields""",
    "processors": [
      {
        "split": {
          "field": "dv_m",
          "separator": "|",
          "target_field": "dv_m_splited"
        }
      },
      {
        "set": {
          "field": "dv_metric_prod",
          "value": "{{dv_m_splited.1}}",
          "override": false
        }
      },
      {
        "set": {
          "field": "dv_metric_section",
          "value": "{{dv_m_splited.2}}",
          "override": false
        }
      }
    ]
  },
  "docs": [
    {

      "_source": {

        "dv_m": "amaze_inc|Understanding"

      }
    }
  ]
}

That generates this response:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "dv_metric_prod" : "m",
          "dv_m_splited" : [
            "a",
            "m",
            "a",
            "z",
            "e",
            "_",
            "i",
            "n",
            "c",
            "|",
            "U",
            "n",
            "d",
            "e",
            "r",
            "s",
            "t",
            "a",
            "n",
            "d",
            "i",
            "n",
            "g"
          ],
          "dv_metric_section" : "a",
          "dv_m" : "amaze_inc|Understanding"
        },
        "_ingest" : {
          "timestamp" : "2021-08-02T08:33:58.2234143Z"
        }
      }
    }
  ]
}

If I set "separator": "\\|", then I will get this error:

{
  "docs" : [
    {
      "error" : {
        "root_cause" : [
          {
            "type" : "general_script_exception",
            "reason" : "Error running com.github.mustachejava.codes.DefaultMustache@776f8239"
          }
        ],
        "type" : "general_script_exception",
        "reason" : "Error running com.github.mustachejava.codes.DefaultMustache@776f8239",
        "caused_by" : {
          "type" : "mustache_exception",
          "reason" : "Failed to get value for dv_m_splited.2 @[query-template:1]",
          "caused_by" : {
            "type" : "mustache_exception",
            "reason" : "2 @[query-template:1]",
            "caused_by" : {
              "type" : "index_out_of_bounds_exception",
              "reason" : "2"
            }
          }
        }
      }
    }
  ]
}

Upvotes: 0

Views: 2148

Answers (1)

Răzvan
Răzvan

Reputation: 991

The solution is fairly simple: just escape your separator.

As the separator field in the split processor is a regular expression, you need to escape special characters such as |.

You also need to escape it twice.

So your code only lacks the double escaping part:

POST _ingest/pipeline/_simulate

{
    "pipeline": {
        "description": "String cutting processing",
        "processors": [
            {
                "split": {
                    "field": "foo",
                    "separator": "\\|"
                }
            }
        ]
    },
    "docs": [
        {
            "_source": {
                "foo": "apple|time"
            }
        }
    ]
}

UPDATE

You did not mention or I missed the part where you wanted to assign the values to two separate fields.

In this case, you should use dissect instead of split. It is shorter, simpler, cleaner. See the documentation here.

POST _ingest/pipeline/_simulate

{
  "pipeline": {
    "description": """combined fields are text that contain  "|" to separate two fields""",
    "processors": [
      {
        "dissect": {
          "field": "dv_m",
          "pattern": "%{dv_metric_prod}|%{dv_metric_section}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "dv_m": "amaze_inc|Understanding"
      }
    }
  ]
}

Result

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "dv_metric_prod" : "amaze_inc",
          "dv_metric_section" : "Understanding",
          "dv_m" : "amaze_inc|Understanding"
        },
        "_ingest" : {
          "timestamp" : "2021-08-18T07:39:12.84910326Z"
        }
      }
    }
  ]
}

ADDENDUM

If using split instead of dissect

You got your array indices wrong. There is no such thing as {{dv_m_splited.2}} as the array index starts from 0 and you only have two results.

This is the correct pipeline when using the split processor.

POST _ingest/pipeline/_simulate

{
  "pipeline": {
    "description": """combined fields are text that contain  "|" to separate two fields""",
    "processors": [
      {
        "split": {
          "field": "dv_m",
          "separator": "\\|",
          "target_field": "dv_m_splited"
        }
      },
      {
        "set": {
          "field": "dv_metric_prod",
          "value": "{{dv_m_splited.0}}",
          "override": false
        }
      },
      {
        "set": {
          "field": "dv_metric_section",
          "value": "{{dv_m_splited.1}}",
          "override": false
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "dv_m": "amaze_inc|Understanding"
      }
    }
  ]
}

Upvotes: 1

Related Questions