- We don’t have novelty for the runtime part
- We likely need to bet on the programming abstraction
We have noir, the platform is ready and proven fast.
Multi stream dataflow abstraction.
What is a distributed application?
- Can we define a distributed application as a set of dataflow streams each implementing a functionality being activated by a request, and triggering a response
- Can we do it with existing dataflow? Not really, because all the streams must be independent. All workers cannot share state because sharing state is dangerous due to how they are implemented
- Why are we different? We can have safe shared local state.
- We can have an application defined as a set of dataflow interacting with some local state
- Decompose the application in a set of streams, the streams can trigger other streams, the streams can interact with local shared state using safe primitives (our contribution)
- How can we generalize this? This could be seen as a noir extension, but noir atm does not have everything we need to do this.
- Assigned scheduling of operators (replicas to specific copy, declarative definition)
- True iterations (the iterations we have right now are a bit of a trick as they can only define full iterations over the dataset, we need the possibility to feedback items as we wish until convergence like timely)
- Request response, we need a reliable and well defined way to bring back a result from a streaming computation
- Generalize languages
- WASM functions where we have closures
- Using the boxed dyn api (pandas thesis) + declarative or programmatic interface we can define the structure and add the function with wasm (macro to compile directly maybe)
- Dynamic reconfiguration, this may be a bit tricky, but ideally the graph structure should be dynamically modifiable as to permit
- We also need a resolver for sure to handle the messaging correctly
- Request response through bidirectional streaming
- The request propagates information forward until the end of the computation (sink?) the sink can then initiate the inverse procedure by propagating back the result along the (same? how?) path to get the response to there it originated.
- This generalizes the dataflow to a structured actor network
- Allows recursive computation patterns when combined with iteration to convergence
- This could use the actor pattern of sending a request together with the return address
Key insight
We have actor models and dataflow frameworks. One of the tenets of these systems is no shared state. The communications only happens through messages to avoid synchronization. We have a hybrid system, which has a synchronization through communication core, but that can use safe primitives for cooperation on the same host. If we embed state management in the system, and we partition the tasks such that conflicting interactions only happen through the communication synchronization (eg lock manager actor/stream) we can exploit the shared state for non-overlapping interactions or interactions with relaxed consistency requirements. We can generalize distributed application as a structured network of (~serverless) functions organized as multiple graphs of operators.