Javier Ramirez
Javier Ramirez

Reputation: 4032

Extract multiple parts of varchar in QuestDB

We have a table with a column 'streamId' that itself is composed of several parts. I'm trying to explode the parts into separate columns in a query on the fly but not getting very far.

This is an extract of my values:

"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_averageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_averageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_event"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_latency95"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_latency99"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_maxMovingAverageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_movingAverageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_movingAverageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-0_sdLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_averageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_averageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_event"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_latency95"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_latency99"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_maxMovingAverageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_movingAverageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_movingAverageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-1_sdLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_averageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_averageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_event"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_latency95"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_latency99"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_maxMovingAverageLatency"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_movingAverageFlow"
"latencyPerNodeByMachineAggregator:SOURCE_X/PROCESS_Y->DESTINATION_Z/strategy_A->fcgw-2_sdLatency"

The components are <sourceId>:<nodePath>_<columnId> That _ is the LAST _ character in the string - nodePath can have _ inside. Also, it is valid to have only <sourceId>:_<columnId> without a nodePath when I have a root node.

It seems QuestDB does not support \1 or similar for substitutions and I need to regexp_replace each part individually, but so far no luck. This is what I have, which is close but it is not capturing the columnId.

  select streamId,regexp_replace(streamId, ':(.+)', '') as sourceId, 
  regexp_replace(streamId, '^(.+:)$', '') as nodepathParts, 
  regexp_replace(streamId, '^(.*)_.*?$', '') as columnId from table;

Upvotes: 0

Views: 8

Answers (1)

Javier Ramirez
Javier Ramirez

Reputation: 4032

For the columnId in QuestDB we can use regexp_replace(streamId, '^.*_', ''), which will replace anything before the very last _ by empty, so the result will be the columnId.

Similarly, we can use this other expression '_([^_]+$)' to replace anything after the last underscore until the end of the string (the columnId) with empty, leaving only the whole string except the last part, and then we can regexp_replace with this '.+:' to remove the beginning of the string until the colon, leaving as result the NodePath (if any).

And we can use this regexp ':(.+)' , which could also have been a split_part(), to keep only the first segment, the sourceId.

Altogether, it would be this:

with s as (
 SELECT 'PROCESS_X-agentPerMachineAggregator:STAGE_A/strategy_A/27320_averageFlow' AS streamId
UNION ALL SELECT 'PROCESS_X-agentPerMachineAggregator:STAGE_A_movingAverageFlow'
UNION ALL SELECT 'PROCESS_X-agentPerMachineAggregator:_averageFlow'
UNION ALL SELECT 'PROCESS_Y-agentPerMachineAggregator:STAGE_B/strategy_A/27320_averageFlow'
UNION ALL SELECT 'PROCESS_Y-latencyPerNodeByMachineAggregator:STAGE_B->STAGE_A/strategy_A->strategy_A_averageFlow'
)
select streamId, 
      regexp_replace(streamId, ':.*', '') as sourceId, 
      regexp_replace(regexp_replace(streamId, '_([^_]+$)', ''),'.+:','') as nodePath, 
      regexp_replace(streamId, '^.*_', '') as columnId
 from s;

Upvotes: 0

Related Questions