Sathishkumar Jayaraj
Sathishkumar Jayaraj

Reputation: 646

How to index and query Nested documents in the Elasticsearch

I have 1 million users in a Postgres table. It has around 15 columns which are of the different datatype (like integer, array of string, string). Currently using normal SQL query to filter the data as per my requirement.

I also have an "N" number of projects (max 5 projects) under each user. I have indexed these projects in the elasticsearch and doing the fuzzy search. Currently, for each project (text file) I have a created a document in the elasticsearch.

Both the systems are working fine.

Now my need is to query the data on both the systems. Ex: I want all the records which have the keyword java (on elasticsearch) and with experience of more than 10 years (available in Postgres).

Since the user's count will be increasing drastically, I have moved all the Postgres data into the elasticsearch.

There is a chance of applying filters only on the fields related to the user (except project related fields).

Now I need to created nest projects for the corresponding users. I tried parent-child types and didn't work for me.

Could anyone help me with the following things?

Upvotes: 0

Views: 259

Answers (1)

Kevin Quinzel
Kevin Quinzel

Reputation: 1428

By your description, we can tell that the "base document" is all based on users.

Now, regarding your questions:

  1. Based on what I said before, you can add all the projects associated to each user as an array. Like this:

{
    "user_name": "John W.",
    ..., #More information from this user
    "projects": [
        {
            "project_name": "project_1",
            "role": "Dev",
            "category": "Business Intelligence",                
        },
        {
            "project_name": "project_3",
            "role": "QA",
            "category": "Machine Learning",
        }
    ]
},
{
    "user_name": "Diana K.",
    ..., #More information from this user
    "projects": [
        {
            "project_name": "project_1"
            "role": "Project Leader",
            "category": "Business Intelligence",

        },
        {
            "project_name": "project_4",
            "role": "DataBase Manager",
            "category": "Mobile Devices",
        },
        {
            "project_name": "project_5",
            "role": "Project Manager",
            "category": "Web services",
        }
    ]
}

This structure is with the goal of adding all the info of the user to each document, doesn't matter if the info is repeated. Doing this will allow you to bring back, for example, all the users that work in a specific project with queries like this:

{
    "query":{
        "match": {
            "projects.name": "project_1"
        }
    }
}
  1. Yes. Like the query above, you can match all the projects by their "category" field. However, keep in mind that since your base document is merely related to users, it will bring back the whole user's document.

For that case, you might want to use the Terms aggregation, which will bring you the unique values of certain fields. This can be "combined" with a query. Like this:

{
    "query":{
        "match": {
            "projects.category": "Mobile Devices"
        }
    }
},
"size", 0 #Set this to 0 since you want to focus on the aggregation's result.
{
    "aggs" : {
        "unique_projects_names" : {
            "terms" : { "field" : "projects.name" } 
        }
    }
}

That last query will bring back, in the aggregation fields, all the unique projects' name with the category "Mobile Devices".

  1. You can create a new index where you'll store all the information related to your projects. However, the relationships betwen users and projects won't be easy to keep (remember that ES is NOT intended for being an structured or ER DB, like SQL) and the queries will become very complex, even if you decide to name both of your indices (users and projects) in a way you can call them with a wildcard.

EDIT: Additional, you can consider store all the info related to your projects in Postgress and do the call separately, first get the project ID (or name) from ES and then the project's info from Postgres (since I assume is maybe the info that is more likely not to change).

Hope this is helpful! :D

Upvotes: 0

Related Questions