Scientific investigation represents one of the most important sources of Big Data. Science generate massive amounts of data through high-rate measurements of physical conditions, environmental and astronomical observations, and high-precision simulations of physical phenomena. These are made possible by the technological advancements in data acquisition equipment and the magnified density acquired by the modern storage devices which practically resulted in infinite storage capacity. Since scientific data are intrinsically ordered - positional and/or temporal - indexed arrays have gained a lot of attention as a more appropriate data structure to represent scientific data - the ubiquitous unordered relational model made popular by business database systems cannot handle massive ordered data optimally. A relation can be viewed as an array without dimensions, only with attributes. Thus, there is no ordering function that allows the identification of a tuple based on dimensional indexes. Going from relations to arrays, it is required that dimension attributes form a key in the corresponding relation, i.e., there is a functional dependence from the dimension attributes to all the other attributes in the relation. Since a key is maximal, any attribute can be immediately transformed into a dimension. Transforming dimensions into attributes is not that straightforward. To be precise, converting a dimension into an attribute is equivalent to destroying the array property and losing any ordering information. As such, any array can be viewed as a particular type of relation organized along dimensions. What distinguishes an array from a relation is that the array is organized such that finding an entry can be done directly from the value of the indexes (the position), without looking at any other entries. This is not possible in a relation since there is no correspondence between the indexes and the actual position in the physical representation. Consequently, we find the main difference between arrays and relations at the physical organization level since arrays are a particular type of relation from an abstract perspective.
Our initial work on array databases is the survey on array storage, query languages, and systems from 2013. The survey summarizes the most important ideas in array storage and processing by identifying the main research problems and organizes this material to provide an accurate perspective. Specifically, we provide a theoretical formalization of array data and a comparison with the relational data model. Array chunking techniques, chunk storage across a single and multiple disks, and chunk organization are presented in great detail. Array algebras and array query languages are presented together with their theoretical formalization. We analyze the proposed systems for large scale array processing up-to-date. We mostly focus on the execution strategies for primary array operations. We allocate ample space discussing the capabilities of the state-of-the-art array processing systems and the most recent research problems in the context of scientific data processing and how they translate to ordered datasets.
EXTASCID (2013) is the modern array database we have developed starting from some of the limitations identified in the survey. It supports all the operations in the SS-DB benchmark proposed to measure the performance of array databases at handling massive scientific datasets. EXTASCID outperforms SciDB on almost all the operations in the SS-DB benchmark, sometimes by orders of magnitude.
Our most recent work has focused on designing advance database techniques for array databases. The first such technique is a shape-based similarity join operator (2016) for multi-dimensional arrays. Then, we have introduced array views (2017) defined over operations that include the shape-based similarity join operator and user-defined array functions in ArrayUDF (2017).
UC Merced |
Last updated: Friday, December 15, 2017