Brief Summary of Thesis Research

Decision support systems are increasingly leveraging the information present in data warehouses. Data analysis and data mining on these repositories pose new challenges as traditional database systems are incapable of handling ad-hoc decision support queries reasonably. The underlying dimensionality of the systems is modeled well with multidimensional databases and these have gained widespread acceptance for OnLine Analytical Processing (OLAP) and data analysis tasks. However, they have been plagued with problems of size and scalability which prevent them from being useful on large data warehouses. Further, automated techniques for finding knowledge from data are encapsulated in data mining tasks such as Association Rules and decision tree based Classification.

My thesis research has focused on making multidimensional databases scalable and integrating data mining tasks with them. I have designed and implemented a parallel scalable infrastructure for multidimensional On-Line Analytical Processing (called PARSIMONY). To our knowledge this is the first effort of its kind in this domain.

1. Multidimensional data analysis

This research introduced a new sparse storage structure, Bit-encoded Sparse Structure (BESS) to store compressed data in chunks and provide aggregate operations on this compressed data which allows it to handle a large number of dimensions while previous methods were unsuitable since they uncompressed the data into multidimensional arrays to perform the operations. An analysis of BESS and other sparse schemes such as hierarchical format, Offset-Value, Compressed-sparse-format was performed. This showed that BESS was faster for the dimensional operations in OLAP analysis. We developed a parallel sort based algorithm to load data in chunks from a tuple format which is locality conscious in chunk accesses as opposed to a hash-based method used in previous work. Optimizations for schedule generation for simultaneous aggregate calculation as needed for the data cube operator in OLAP have been proposed in the sequential domain. Analysis of communication costs was done to be incorporated in the parallel scheduler in addition to the other optimizations of smallest parent, reduced I/O scans and cached results. For increasing performance, the choice of partitioning of sub-cubes is adaptive, depending on the the largest two dimensions present. Uniprocessor sub-cubes and sub-cubes partitioned on one or two dimensions are supported, as are both chunked cube storage and array cube storage depending on density and cube size. We show that dimension oriented chunk traversal outperforms the chunk-storage order traversal used in previous work. Scalability of the system stems from the novel design of the directory structure for data cubes which holds the meta-data. Paged data structures are used for efficient use of memory, for the data cube structure and the chunk structure of each cube, instead of relying on the system virtual memory. Minichunks are introduced for storing large chunks in parts and these are swappable. Extensive computation, communication and I/O performance analysis is done on a variety of data sets to study the effect of various design parameters and gain good scaleup and sizeup performance.

2. Data Mining

We developed an integrated framework for OLAP and data mining and developed new parallel algorithms for attribute-oriented associations and decision tree based classification using the multidimensional model. Attribute focusing and attribute-oriented association rules calculations can be done on the summary values present in the data cube. The aggregate values present in the higher level aggregations, and can be used to compute the probabilities of the various attribute values. We introduced the construction of a partial data cube using the left and right schedules to reduce the number of intermediate aggregates for the task of 2-way and 3-way associations. We identified calculation of support and confidence measures on the data cube structure to adapt another popular data mining algorithm.

We proposed a new algorithm for decision tree based classification on the OLAP infrastructure which uses lesser communication at each level of the classification tree than the previous attribute-list based scalable versions in the literature. Dimensional operations are used to maintain class id. counts for each dimension on each node of the tree. Communication is done to exchange these summary counts and concatenated parallelism is used to perform only one communication round at each level. The message size is of the order of the active nodes at a level and the dimension sizes which is usually much less than the number of total records.

Post-PhD Research at Northwestern

MAFIA - Parallel Large Scale Subspace Clustering

Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this research we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottom-up algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determines the computation requirements and the quality of the clustering results. Hence, it is important to determine the appropriate size and number of the bins.

We present MAFIA, which 1) proposes adaptive grids for fast subspace clustering and 2) introduces a scalable parallel framework on a shared-nothing architecture to handle massive data sets. Performance results on very large data sets and a large number of dimensions show very good results, making an order of magnitude improvement in the computation time over current methods and providing much better quality of clustering.

(work with Harsha Nagesh and Alok Choudhary)