MapReduce single item from streaming source against finite items from map in Hazelcast Jet

Question

Being new to Hazelcast Jet, I was trying to build a setup where single item from an infinite source (i.e. a Map Journal of user requests) is MapReduced against a (possibly changing and) huge Map of reference items.

Specifically, for this example I want to determine the IDs of the vectors (read: float[]) of the smallest Euclidean distance in a map of vectors (the references), given a used-defined input vector (the query).

If implemented naively on a single machine, this would be going through the Map items of the references and determining the euclidean distance to the query for each of them, while keeping the k-smallest matches, where the input is taken from a user request (HTTP POST, button click, etc.) and the result set is available immediately after the computation finishes.

My recent approach was to:

Listen on a map journal for the request
.distributed().broadcast() the request to the mapping job
have the mapping job obtain the .localKeySet() of the reference vectors
emit the IDs of the k-smallest vectors (by euclidean distance)
reduce/collect the results on a single node via a .partitioned(item -> item.requestId) partitioning
store the results to a map on which the client has a key listener.

Conceptually here every query is a batch of size 1 and I'm actually processing batches as they come. However, I have massive troubles letting the mappers and reducers know when a batch is done so that the collectors know when they are done (so that they can emit the final result).

I tried using watermarks both with real and fake timestamps (obtained automatically via an AtomicLong instance) and emit from the tryProcessWm functions, however that seems to be a very brittle solution as some of the events are dropped. I also need to make sure no two requests are interleaved (i.e. using partitioning on the request ID), but at the same time have the mapper run on all nodes ...

How would I attack this task?

Edit #1:

Right now, my mapper looks like this:

private static class EuclideanDistanceMapP extends AbstractProcessor {
    private IMap referenceVectors;

    final ScoreComparator comparator = new ScoreComparator();

    @Override
    protected void init(@Nonnull Context context) throws Exception {
        this.referenceVectors = context.jetInstance().getMap(REFERENCE_VECTOR_MAP_NAME);
        super.init(context);
    }

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        final Tuple3 query = (Tuple3)item;
        final long requestId = query.f0();
        final long timestamp = query.f1();
        final float[] queryVector = query.f2();

        final TreeSet> buffer = new TreeSet<>(comparator);
        for (Long vectorKey : referenceVectors.localKeySet()) {
            float[] referenceVector = referenceVectors.get(vectorKey);
            float distance = 0.0f;

            for (int i = 0; i < queryVector.length; ++i) {
                distance += (queryVector[i] - referenceVector[i]) * (queryVector[i] - referenceVector[i]);
            }

            final Tuple2 score = Tuple2.tuple2(vectorKey, (float) Math.sqrt(distance));
            if (buffer.size() < MAX_RESULTS) {
                buffer.add(score);
                continue;
            }

            // If the value is larger than the largest entry, discard it.
            if (comparator.compare(score, buffer.last()) >= 0) {
                continue;
            }

            // Otherwise we remove the largest entry after adding the new one.
            buffer.add(score);
            buffer.pollLast();
        }

        return tryEmit(Tuple3.tuple3(requestId, timestamp, buffer.toArray()));
    }

    private static class ScoreComparator implements Comparator> {
        @Override
        public int compare(Tuple2 a, Tuple2 b) {
            return Float.compare(a.f1(), b.f1());
        }
    }
}

The reducer is essentially replicating that (minus the vector calculation, of course).

Edit #2:

Here's the DAG setup. It currently fails when there are more than a handful of concurrent requests. Most of the items are dropped due to the watermarks.

DAG dag = new DAG();
Vertex sourceStream = dag.newVertex("source",
    SourceProcessors.>streamMapP(QUERY_VECTOR_MAP_NAME,
            e -> e.getType() == EntryEventType.ADDED || e.getType() == EntryEventType.UPDATED,
            e -> Tuple2.tuple2(e.getKey(), e.getNewValue()),true));

// simple map() using an AtomicLong to create the timestamp    
Vertex addTimestamps = dag.newVertex("addTimestamps", AddTimestampMapP::new);

// the class shown above.
Vertex map = dag.newVertex("map", EuclideanDistanceMapP::new);

Vertex insertWatermarks = dag.newVertex("insertWatermarks",
        insertWatermarksP((Tuple3 t) -> t.f1(), withFixedLag(0), emitByMinStep(1)));

Vertex combine = dag.newVertex("combine", CombineP::new);

// simple map() that drops the timestamp
Vertex removeTimestamps = dag.newVertex("removeTimestamps", RemoveTimestampMapP::new);

// Using a list here for testing.
Vertex sink = dag.newVertex("sink", SinkProcessors.writeListP(SINK_NAME));

dag.edge(between(sourceStream, addTimestamps))
    .edge(between(addTimestamps, map.localParallelism(1))
        .broadcast()
        .distributed())
    .edge(between(map, insertWatermarks).isolated())
    .edge(between(insertWatermarks, combine.localParallelism(1))
            .distributed()
            .partitioned((Tuple2[]> item) -> item.f0()))
    .edge(between(combine, removeTimestamps)
            .partitioned((Tuple3[]> item) -> item.f0()))
    .edge(between(removeTimestamps, sink.localParallelism(1)));

Edit #3:

Here's my current combiner implementation. I assume that all items will be ordered according to the watermarks; or in general that only items of the same request will be collected by the same combiner instance. This doesn't seem to be true though ...

private static class CombineP extends AbstractProcessor {
    private final ScoreComparator comparator = new ScoreComparator();
    private final TreeSet> buffer = new TreeSet<>(comparator);
    private Long requestId;
    private Long timestamp = -1L;

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        final Tuple3[]> itemTuple = (Tuple3[]>)item;
        requestId = itemTuple.f0();

        final long currentTimestamp = itemTuple.f1();
        if (currentTimestamp > timestamp) {
            buffer.clear();
        }
        timestamp = currentTimestamp;

        final Object[] scores = itemTuple.f2();

        for (Object scoreObj : scores) {
            final Tuple2 score = (Tuple2)scoreObj;

            if (buffer.size() < MAX_RESULTS) {
                buffer.add(score);
                continue;
            }

            // If the value is larger than the largest entry, discard it.
            if (comparator.compare(score, buffer.last()) >= 0) {
                continue;
            }

            // Otherwise we remove the largest entry after adding the new one.
            buffer.add(score);
            buffer.pollLast();
        }

        return true;
    }

    @Override
    protected boolean tryProcessWm(int ordinal, @Nonnull Watermark wm) {
        // return super.tryProcessWm(ordinal, wm);
        return tryEmit(Tuple3.tuple3(requestId, timestamp, buffer.toArray())) && super.tryProcessWm(ordinal, wm);
    }

    private static class ScoreComparator implements Comparator> {
        @Override
        public int compare(Tuple2 a, Tuple2 b) {
            return Float.compare(a.f1(), b.f1());
        }
    }
}

MapReduce single item from streaming source against finite items from map in Hazelcast Jet

Answers (1)

Related Questions