Generalized Linear Aggregate Distributed Engine (GLADE)

GLADE is a scalable distributed framework for large scale data analytics. GLADE consists of a simple user-interface to define Generalized Linear Aggregates (GLA), the fundamental abstraction at the core of GLADE, and a distributed runtime environment that executes GLAs by using parallelism extensively. GLAs are derived from User-Defined Aggregates (UDA), a relational database extension that allows the user to add specialized aggregates to be executed inside the query processor. GLAs extend the UDA interface with methods to Serialize/Deserialize the state of the aggregate required for distributed computation. As a significant departure from UDAs which can be invoked only through SQL, GLAs give the user direct access to the state of the aggregate, thus allowing for the computation of significantly more complex aggregate functions. GLADE runtime is an execution engine optimized for the GLA computation. The runtime takes the user-defined GLA code, compiles it inside the engine, and executes it right near the data by taking advantage of parallelism both inside a single machine as well as across a cluster of computers. This results in maximum possible execution time performance and linear scaleup.

GLADE is a system for the execution of analytical tasks that are associative-decomposable using the iterator-based interface. It exposes the UDA interface consisting of four user-defined functions. The input consists of tuples extracted from a column-oriented relational store, while the output is the state of the computation. The execution model is a multi-level tree in which partial aggregation is executed at each level. The system is responsible for building and maintaining the tree and for moving data between nodes. Except these, the system executes only user code. The blend of column-oriented relations with a tree-based execution architecture allows GLADE to obtain remarkable performance for a variety of analytical tasks---billions of tuples are processed in seconds using only a dozen of commodity nodes.

DataPath. GLADE is a relational execution engine derived from DataPath, a highly efficient multi-query processing database system. The DataPath execution engine has at its core two components: waypoints and work units. A waypoint manages a particular type of computation, e.g., selection, join, etc. The code executed by each waypoint is configured at runtime based on the running queries. In DataPath, all the queries are compiled and loaded in the execution engine at runtime rather than being interpreted. This provides a significant performance improvement at the cost of some negligible compilation time. A waypoint is not executing any query processing job by itself. It only delegates a particular task to a work unit. A work unit is a thread from a thread pool (there are a fixed number of work units in the system) that can be configured to execute tasks. At a high level, the way the entire query execution process happens in DataPath is as follows:
  1. When a new query arrives in the system, the code to be executed is first generated, then compiled and loaded into the execution engine. Essentially, the waypoints are configured with the code to execute for the new query.
  2. Once the storage manager starts to produce chunks for the new query, they are routed to waypoints based on the query execution plan.
  3. If there are available work units in the system, a chunk and the task selected by its current waypoint are sent to a work unit for execution.
  4. When the work unit finishes a task, it is returned to the pool and the chunk is routed to the next waypoint.

GLADE. GLADE consists of two types of nodes: a coordinator and workers. The coordinator is in charge of scheduling the GLA computation and managing the execution across the workers. Each worker runs an instance of DataPath GLA enhanced with communication capabilities. When a job is received by the coordinator, the following steps are executed:
  1. The coordinator generates the code to be executed at each waypoint in the DataPath execution plan. A single execution plan is used for all the workers.
  2. The coordinator creates an aggregation tree connecting all the workers. The tree is used for in-network aggregation of the GLAs.
  3. The execution plan, the code, and the aggregation tree information are broadcasted to all the workers.
  4. Once the worker configures itself with the execution plan and loads the code, it starts to compute the GLA for its local data.
  5. When a worker completes the computation of the local GLA, it first communicates this to the coordinator---the coordinator uses this information to monitor how the execution evolves. If the worker is a leaf, it sends the serialized GLA to its parent in the aggregation tree immediately.
  6. A non-leaf node has one more step to execute. It needs to aggregate the local GLA with the GLAs of its children. For this, it first deserializes the external GLAs and then executes another round of Merge functions. In the end, it sends the combined GLA to the parent.
  7. The worker at the root of the aggregation tree calls the function Terminate before sending the final GLA to the coordinator who passes it further to the client who sent the job.
For a more detailed description of GLADE and experimental results, check the following documents:
  1. GLADE: Big Data Analytics Made Easy by Y. Cheng, C. Qin, and F. Rusu
  2. GLADE: A Scalable Framework for Efficient Analytics by F. Rusu and A. Dobra
  3. GLADE: A Highly-Scalable Architecture-Independent Framework for Efficient Analytics: Poster by F. Rusu
  4. GLADE: A Scalable Framework for Efficient Analytics by F. Rusu

UC Merced | EECS | Home

Last updated: Friday, March 9, 2012