Reputation: 1281
I have two streams. First one is time-based stream and I used the countTimeWindow
to receive first 10 data points for calculating stat value. I manually used the variable cnt
to only keep the first window and filtered the remaining values as shown in the below code.
And then, I want to use this value to filter the main stream in order to have the values which is greater than the stat value that I computed in the window stream.
However, I don't have any idea how to merge or calculate these two streams for achieving my goal.
My scenario is that if I convert the first stat value into the broadcast variable, then I give it to the main stream so that I am able to filter the in-coming values based on the stat value in the broadcast variable.
Below is my code.
import com.sun.org.apache.xpath.internal.operations.Bool;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.functions.windowing.*;
import org.apache.flink.util.Collector;
import scala.Int;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class ReadFromKafka {
static int cnt = 0;
public static void main(String[] args) throws Exception{
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "flink");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("flinkStreaming11", new SimpleStringSchema(), properties));
env.enableCheckpointing(1000);
//Time based window stream
DataStream<String> process = stream.countWindowAll(10).process(new ProcessAllWindowFunction<String, Tuple2<Double, Integer>, GlobalWindow>() {
@Override
public void process(Context context, Iterable<String> iterable, Collector<Tuple2<Double, Integer>> collector) throws Exception {
Double sum = 0.0;
int n = 0;
List<Double> listDouble = new ArrayList<>();
for (String in : iterable) {
n++;
double d = Double.parseDouble(in);
sum += d;
listDouble.add(d);
}
cnt++;
Double[] sd = listDouble.toArray(new Double[listDouble.size()]);
double mean = sum / n;
double sdev = 0;
for (int i = 0; i < sd.length; ++i) {
sdev += ((sd[i] - mean) * (sd[i] - mean)) / (sd.length - 1);
}
double standardDeviation = Math.sqrt(sdev);
collector.collect(new Tuple2<Double, Integer>(mean + 3 * standardDeviation, cnt));
}
}).filter(new FilterFunction<Tuple2<Double, Integer>>() {
@Override
public boolean filter(Tuple2<Double, Integer> doubleIntegerTuple2) throws Exception {
Integer i1 = doubleIntegerTuple2.f1;
if (i1 > 1)
return false;
else
return true;
}
}).map(new RichMapFunction<Tuple2<Double, Integer>, String>() {
@Override
public String map(Tuple2<Double, Integer> doubleIntegerTuple2) throws Exception {
return String.valueOf(doubleIntegerTuple2.f0);
}
});
//I don't think that this is not a proper solution.
process.union(stream).filter(new FilterFunction<String>() {
@Override
public boolean filter(String s) throws Exception {
return false;
}
})
env.execute("InfluxDB Sink Example");
env.execute();
}
}
Upvotes: 0
Views: 1005
Reputation: 9245
First, I think you only have one stream, right? There's only one Kafka-based source of doubles (encoded as Strings).
Second, if the first 10 values really do permanently define the limit for filtering, then you can just run the stream into a RichFlatMap function, where you capture the first 10 values to calculate your max value, and then filter all subsequent values (only output values >= this limit).
Note that typically you'd want to save state (array of 10 initial values, plus the limit) so that your workflow can be restarted from a checkpoint/savepoint.
If instead you are constantly re-calculating your limit from the most recent 10 values, then the code is just a bit more complex, in that you have a queue of values, and you need to do the filtering on the value being flushed from the queue when you add a new value.
Upvotes: 1