Tech Talk: Machine Learning at Scale Using Distributed Stream Processing

This video will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist’s work taken in almost unchanged form from their Python development environment.

About this course

The capabilities of machine learning are now pretty well understood and there are great tools to do data science and construct models that answer nontrivial questions about your data. These tools are mostly used from Python. The key new challenge is making the trained prediction model usable in real-time, while the user is interacting with your software. Getting answers from an ML model (this is called inference) takes a lot of CPU and must be done at serious scale. The ML tools are optimized mainly for batch-processing a lot of data at once, and often the implementations aren’t parallelized. In this talk, I will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist’s work taken in almost unchanged form from their Python development environment. The talk includes a live demo using the command line and going through some Python and Java code snippets. Presented by Marko Topolink Marko Topolnik is a senior engineer in the Jet Core team. He has been with Hazelcast® since 2015, holds a Ph.D. in computer science and has a six-figure score on Stack Overflow.

About this course

The capabilities of machine learning are now pretty well understood and there are great tools to do data science and construct models that answer nontrivial questions about your data. These tools are mostly used from Python. The key new challenge is making the trained prediction model usable in real-time, while the user is interacting with your software. Getting answers from an ML model (this is called inference) takes a lot of CPU and must be done at serious scale. The ML tools are optimized mainly for batch-processing a lot of data at once, and often the implementations aren’t parallelized. In this talk, I will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist’s work taken in almost unchanged form from their Python development environment. The talk includes a live demo using the command line and going through some Python and Java code snippets. Presented by Marko Topolink Marko Topolnik is a senior engineer in the Jet Core team. He has been with Hazelcast® since 2015, holds a Ph.D. in computer science and has a six-figure score on Stack Overflow.