I am going through the discussion on which is best way to design our API (Stream vs Collection as return type). The discussion in this post is very valuable. @BrainGotez answer mentions this one condition where collections are better than streams. I couldn't quite understand what this means, can someone please help with an example of explanation? "when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target." My question is, specifically, what "strong consistency requirements" mean and "consistent snapshot of a moving target" mean in real world applications?

Stream vs Collection as return type

Answers (5)

Reputation: 132350

In this context, the notion of "strong consistency requirement" is relative to the system or application within which the code resides. There's no specific notion of "strong consistency" that's independent of the system or application. Here's an example of "consistency" that is determined by what assertions you can make about a result. It should be clear that the semantics of these assertions are entirely application-specific.

Suppose you have some code that implements a room where people can enter and leave. You might want the relevant methods to be synchronized so that all enter and leave actions occur in some order. For example: (using Java 16)

record Person(String name) { }

public class Room {
    final Set<Person> occupants = Collections.newSetFromMap(new ConcurrentHashMap<>());

    public synchronized void enter(Person p) { occupants.add(p); }
    public synchronized void leave(Person p) { occupants.remove(p); }
    public Stream<Person> occupants() { return occupants.stream(); }
}

(Note, I'm using ConcurrentHashMap here because it doesn't throw ConcurrentModificationException if it's modified during iteration.)

Next, consider some threads to execute these methods in this order:

room.enter(new Person("Brett"));
room.enter(new Person("Chris"));
room.enter(new Person("Dana"));
room.leave(new Person("Dana"));
room.enter(new Person("Ashley"));

Now, at around the same time, suppose a caller gets a list of persons in the room by doing this:

List<Person> occupants1 = room.occupants().toList();

The result might be:

[Dana, Brett, Chris, Ashley]

How is this possible? The stream is lazily evaluated, and the elements are being pulled into a List at the same time other threads are modifying the source of the stream. In particular, it's possible for the stream to have "seen" Dana, then Dana is removed and Ashley added, and then the stream advances and encounters Ashley.

What does the stream represent, then? To find out, we have to dig into what ConcurrentHashMap says about its streams in the presence of concurrent modification. The set is built from CHM's keySet view, which says "The view's iterators and spliterators are weakly consistent." The definition of weakly consistent is in turn:

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators and Spliterators provide weakly consistent rather than fast-fail traversal:

they may proceed concurrently with other operations

they will never throw ConcurrentModificationException

they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

What does this mean for our Room application? I'd say it means that if a person appears in the stream of occupants, that person was in the room at some point. That's a pretty weak statement. Note in particular that it does not allow you say that Dana and Ashley were in the room at the same time. It might seem that way from the contents of the List, but that would be incorrect, as a simple inspection reveals.

Now suppose we were to change the Room class to return a List instead of a Stream, and the caller were to use that instead:

// in class Room
public synchronized List<Person> occupants() { return List.copyOf(occupants); }

// in the caller
List<Person> occupants2 = room.occupants();

The result might be:

[Dana, Brett, Chris]

You can make much stronger statements about this List than about the previous one. You can say that Chris and Dana were in the room at the same time, and that at this particular point in time, that Ashley was not in the room.

The List version of occupants() gives you a snapshot of the occupants of the room at a particular time. This allows you much stronger statements than the stream version, which only tells you that certain persons were in the room at some point.

Why would you ever want an API with weaker semantics? Again, it depends on the application. If you want to send a survey to people who used room, all you care about is whether they were ever in the room. You don't care about other things, like who else was in the room at the same time.

The API with stronger semantics is potentially more expensive. It needs to make a copy of the collection, which means allocating space and spending time copying. It needs to hold a lock while it does this, to prevent concurrent modification, and this temporarily blocks other updates from proceeding.

To summarize, the notion of "strong" or "weak" consistency is highly dependent on the context. In this case I made up an example with some associated semantics, such as "in the room at the same time" or "was in the room at some point in time." The semantics required by the application determine the strength or weakness of the consistency of the results. This in turn drives what Java mechanisms should be used, such as streams vs. collections and when to apply locks.

Upvotes: 4

Tomes

Reputation: 185

"when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target."

What the author @Brian Goetz was referring to is the point in time when the stream gets consumed.

Here lays the first misunderstanding of the java.util.stream-API.

When you return a stream, you get a handle on an object, which did not started its pull yet.

Only when you invoke a termination method, the collection will get iterated. Until this point, the collection and their items can change. And this is the only lazy part about a stream. Otherwise you probably want to ride the bull of RxJava2.. ;- )

// EDIT FOR THAT BOUNTY:

A real world example would be: To this exact moment, which is the price of these specific shares?

Then you want to pass immutable objects, which one can use to place a order after inspecting.

If in meanwhile the price changes - but the object is required to place an order - you do not care how long your user takes to place it. The price was just fixed beforehand.

// EDIT END.

Anyhow, the same can happen to a collection until you start iterating. Both these cases are related to concurrent access.

Also, this isn't an iteration of the items per-se.
Each object is passed through the chain.

Therefore you have to approach the entire question differently, imho.

Should the collection be mutable or immutable?
Are you passing immutable objects? (If not, you need to consider the following question:)
Do you pass the references to the objects, so they can get altered or is a deep-copy required?

So after these questions are answered, let's talk about a disadvantage of streams: O(n) access. The user wants to access an object at index. First, he has to iterate all objects to append it to a new data structure. Or he has to iterate in order until this item is visited. The latter only in the worst-case-scenario but - A new data structure just doubled the heap-memory allocation. And this also will affect the garbage collection afterwards.

But why are streams so darn cute?

Because you can write code which is just more readable. That's it! When all the client does is consuming the items, then it is good advice for him to use streams. This way his code-base is more readable.
There is this big elephant in the room - concurrency. When used appropriately, it is cheap (in terms of development time) to introduce mature multi-threading.
Streams implement the AutoClosable-Interface, which is nice.

Elaborating on the third point: When you need to close a resource after consuming, it is always necessary to do this on your own. Therefore a Visitor-Pattern is the more applicable option - And within this the user can choose on its own, if he wants to use a stream or a collection. :- )

Imo, you should always stick to collections for an api. This way you are not requiring the familiarity of the stream-api. Anybody who wants to use streams can do so on (in) their own.

// EDIT 2: Elaborate on the confusion of streams - OPINIONATED

This "strong consistency requirements" seems related to more of design requirement. I would be happy to provide the bounty if the answer has details with authoritative references.

It is not about streams vs. collections. It is about the point-in-time one consumes the collection (both are collections anyway). If your user only wants to get the current state of objects, you return a collection. If your user wants to subscribe to new items, he would register an Observable at your api.

This is, imo, were the confusion about streams is rooted. There are the libraries from https://reactiveX.io which provide an stream-like interface to subscribing to a data source.

This picture shows the time-line of one of their classes. What is happening is quite simple: The caller registers transformation-methods and callbacks which are invoked, once you start to emit items. This is the exact old principle of an Observer-Callback. I would highly advice against using Observables for various reasons.

All colleagues have to be familiar with them
Debugging will get harder, since the callstacks are way more verbose.
One can easily end up in callback-hell.
Application is highly specialized, use them rarely. They are a good fit if you are emitting the same items for every user continuously. If you are doing normal CRUD-operations, don't introduce Observables.

They are fun, though. :- )

Upvotes: 8

Harun Cetin

Reputation: 109

I will try to explain your sentence as short as possible (as far as I know). The first term I would like to explain is "strong consistency requirements". For example banking applications, realtime network traffic analysis, etc. are some mission critical applications and consistency (in every sense) is the first requirement here, because any dislocation or missing part of data may occur data integrity problem. So, it will cause inconsistency of data. This is a very big problem for these applications.

Second term is "you have to produce a consistent snapshot of a moving target". Here we can say for the "moving target" is our streaming data. If you do machine learning on (real time) streaming data, you have to do sampling from the data to process in a machine learning (or deep learning) algorithm. To do that you should pick samples from the data in certain time intervals (time frames) and process that bulk of the data, and then the next one. This process is called batch (or bulk) process. In this context, we can say that our "snapshot" term is the samples here. Then we should pick each of data samples from the stream certainly the "certain" time intervals and ensure the integrity of data in the sample (batch) somehow.

Upvotes: 1

Satyam Singh

Reputation: 176

So basically when you return a collection, you are returning the snapshot of players object at that particular moment. That is, a copy of players object at the time of calling "getPlayersAsCollection" method in this case. Any change by other threads to players list will not be reflected to the collection returned earlier. This explains, the consistency is maintained and at the time of calling getPlayersAsCollection method you actually got what's present in the players list which is constantly being modified by adding new player details or removing player details from it. And that explains consistent snapshot of a moving target.

class Team {
    private List<Player> players = new ArrayList<>();

    // ...

    public List<Player> getPlayersAsCollection() {
        return Collections.unmodifiableList(players);
    }

    public Stream<Player> getPlayersAsStream() {
        return players.stream();
    }
}

Whereas, when a stream is returned here, it will be like the pointer to the list players is returned. Any change to players object in between the Stream is returned by "getPlayersAsStream" method and when you try to access or perform stream operations on stream object the change done on players object will also be reflected here. So there is "no strong consistency" in this case as data is changed from the time getPlayersAsStream is called and got the response and when you tried to access that response(Stream).

But again, returning Stream has its own advantages as it was explained in the link shared in the question. It depends on the particular use case whether to return Stream or Collection.

I hope this helps and clarifies your doubt on "when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target."

Upvotes: 2

Akshay Gehi

Reputation: 362

Think of strong consistency as a point in time snapshot of a source that is ever changing. Suppose you are an e-commerce giant and want to see the sales during a month, you could return from the database a snapshot of all the sale records between 1st December to 31st December, this is a finite collection (e.g. List), even though it could be rather large for some companies. It is a consistent snapshot Collection because the existing sales could change over time due to cancellations or returns however the API simply provides a point in time snapshot of how the sales looked when the List was created. In another use case for the same company, suppose the data science team has an application which constantly monitors the sale transactions as they happen (moving target) in-order to detect fraud, it is a continuous stream of data which has no finite boundaries however every transaction from this stream is picked up and analyzed.

Upvotes: 1

Stream vs Collection as return type

Answers (5)

Related Questions