Reputation: 63
I have this json schema
{
"name":"Pete"
"age":24,
"subjects":[
{
"name":"maths"
"grade":"A"
},
{
"name":"maths"
"grade":"B"
}
]
}
and I want to ingest this into a pinot table to run a query like
select age,subjects_grade,count(*) from table group by age,subjects_grade
Is there a way to do this in a pinot job?
Upvotes: 4
Views: 1867
Reputation: 131
Pinot has two ways to handle JSON records:
1. Flatten the record during ingestion time: In this case, we treat each nested field as a separated field, so need to:
Please see how column subjects_name
and subjects_grade
is defined below. Since it's an array, so both fields are multi-value columns in Pinot.
2. Directly ingest JSON records
In this case, we treat each nested field as one single field, so need to:
jsonFormat
to stringify the JSON field in table configPlease see how column subjects_str
is defined below.
Below is the sample table schema/config/query:
Sample Pinot Schema:
{
"metricFieldSpecs": [],
"dimensionFieldSpecs": [
{
"dataType": "STRING",
"name": "name"
},
{
"dataType": "LONG",
"name": "age"
},
{
"dataType": "STRING",
"name": "subjects_str"
},
{
"dataType": "STRING",
"name": "subjects_name",
"singleValueField": false
},
{
"dataType": "STRING",
"name": "subjects_grade",
"singleValueField": false
}
],
"dateTimeFieldSpecs": [],
"schemaName": "myTable"
}
Sample Table Config:
{
"tableName": "myTable",
"tableType": "OFFLINE",
"segmentsConfig": {
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "myTable",
"replication": "1"
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": [],
"noDictionaryColumns": [
"subjects_str"
],
"jsonIndexColumns": [
"subjects_str"
]
},
"metadata": {
"customConfigs": {}
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY",
"batchConfigMaps": [],
"segmentNameSpec": {},
"pushSpec": {}
},
"transformConfigs": [
{
"columnName": "subjects_str",
"transformFunction": "jsonFormat(subjects)"
},
{
"columnName": "subjects_name",
"transformFunction": "jsonPathArray(subjects, '$.[*].name')"
},
{
"columnName": "subjects_grade",
"transformFunction": "jsonPathArray(subjects, '$.[*].grade')"
}
]
}
}
Sample Query:
select age, subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
select age, json_extract_scalar(subjects_str, '$.[*].grade', 'STRING') as subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
Comparing both ways, we recommend solution 1 to flatten the nested fields out when the field density is high(e.g. every document has field name and grade, then it's worth extracting them out to be new columns), it gives better query performance and better storage efficiency.
For solution 2, it's simpler in configuration, and good for sparse fields(e.g. only a few documents have certain fields). It requires to use json_extract_scalar function to access the nested field.
Please also note the behavior of Pinot GROUP BY on multi-value columns.
More references:
Upvotes: 8