Collecting workload metrics of distributed system is important task to improve performance and make it scalable. Presto is not exception. Presto is fast distributed SQL engine mainly developed by Facebook. Recently AWS start using Presto as backend of Amazon Athena. We are using Presto in daily analysis too. Just same as other distributed system like Hadoop, Presto also has its own terminology. In order to measure cluster workload correctly, understanding these notions and components are necessary.
So this post described the difference of
Pipeline which are used in Presto internally.
First of all, a Presto query can be divided into 3 units.
Query is a query you submitted. The query is divided into several stages which represents some collection of operators. Query is scheduled per stages. And upper stream stages should wait downstream stages completion because upper stream stages depend on the data generated by downstream.
Task is a unit for actual data processing. A stage includes several tasks in it. So task is a parallel unit among multiple workers. Tasks in the same stage are doing same thing other than target data and worker instance on which a task is running. So when you have more tasks in a stage, it means more parallelism.
This is a live plan diagram generated by Presto. You can check this from query UI. It shows each stage and operators run by tasks (task is not described here explicitly though) of the query. The component like
exchange are operators. These operators do actual data processing. So this is the basic diagram of Presto query internal.
Let’s introduce Driver first. We can regard Driver as resource of data processing. Large query consumes a lot of Drivers. We can see the query progress by checking the number of finished drivers. A driver contains multiple operators. So a task runs several drivers which includes multiple operators.
DriverFactory. Pipeline keeps operators and configurations to create driver. All drivers created by a pipeline have same operator types and same configurations. In my understanding, there are some operators which are often used with other operators. (e.g.
exchange etc). So creating drivers from pipeline good way. Pipeline is just a factory to create drivers, not runtime component. So we cannot see any metrics about pipeline.
What’s Split? Split is a task for data scanning from connector, or other workers. It provides the data source processed by each drivers. It is responsible for handling I/O between other external system. If you have any experience of creating Presto connector, you might see split because connector passes the actual data based on the unit of split. Several splits can be included in a driver. If the pipeline starts with tablescan operator, driver is created per split. If the pipeline is an intermediate data processing, it depends on
Overall driver is responsible for actual data processing. And driver can be a good indicator of workload of the cluster. If more drivers is running, we can consider that cluster workload is growing. Of course correct load of each driver also depends on the data size and operator types. But it is difficult to measure CPU usage and processed data size per query. So in reality measuring the number of drivers is a good way to know the cluster workload, I think. If you know the good measurement of Presto or general knowledge of metrics of distrbuted system, I would be glad when you let me know.