- Parallel scalable algorithms and I/O for databases and scientific applications
- Large scale multidimensional databases and OLAP
- Data analysis of large data sets
- Scalable data mining - Associations, Classification and Clustering
- Parallel Systems and Computer Architecture
- Performance modeling
- Networked Computing
- Parallel algorithms
- Design and analysis of algorithms for parallel treecodes, multidimensional binary search trees
- Investigating combinations of task and data parallelism
Brief Summary of Thesis Research
Decision support systems are increasingly leveraging the information
present in data warehouses. Data analysis and data mining on these
repositories pose new challenges as traditional database systems are
incapable of handling ad-hoc decision support queries reasonably. The
underlying dimensionality of the systems is modeled well with
multidimensional databases and these have gained widespread acceptance
for OnLine Analytical Processing (OLAP) and data analysis tasks.
However, they have been plagued with problems of size and scalability
which prevent them from being useful on large data
warehouses. Further, automated techniques for finding knowledge from
data are encapsulated in data mining tasks such as Association Rules
and decision tree based Classification.
My thesis research has focused on making multidimensional databases
scalable and integrating data mining tasks with them. I have designed
and implemented a parallel scalable infrastructure for multidimensional
On-Line Analytical Processing (called PARSIMONY). To our knowledge
this is the first effort of its kind in this domain.
1. Multidimensional data analysis
This research introduced a new sparse storage structure,
Bit-encoded Sparse Structure (BESS) to store compressed data in
chunks and provide aggregate operations on this compressed data which
allows it to handle a large number of dimensions while previous
methods were unsuitable since they uncompressed the data into
multidimensional arrays to perform the operations. An analysis of BESS
and other sparse schemes such as hierarchical format, Offset-Value,
Compressed-sparse-format was performed. This showed that BESS was
faster for the dimensional operations in OLAP analysis. We developed
a parallel sort based algorithm to load data in chunks from a tuple
format which is locality conscious in chunk accesses as opposed to a
hash-based method used in previous work. Optimizations for schedule
generation for simultaneous aggregate calculation as needed for the
data cube operator in OLAP have been proposed in the sequential
domain. Analysis of communication costs was done to be incorporated in
the parallel scheduler in addition to the other optimizations of
smallest parent, reduced I/O scans and cached results. For increasing
performance, the choice of partitioning of sub-cubes is adaptive,
depending on the the largest two dimensions present. Uniprocessor
sub-cubes and sub-cubes partitioned on one or two dimensions are
supported, as are both chunked cube storage and array cube storage
depending on density and cube size. We show that dimension oriented
chunk traversal outperforms the chunk-storage order traversal used in
previous work.
Scalability of the system stems from the novel design of the
directory structure for data cubes which holds the meta-data. Paged
data structures are used for efficient use of memory, for the data
cube structure and the chunk structure of each cube, instead of
relying on the system virtual memory. Minichunks are introduced
for storing large chunks in parts and these are swappable. Extensive
computation, communication and I/O performance analysis is done on a
variety of data sets to study the effect of various design parameters
and gain good scaleup and sizeup performance.
2. Data Mining
We developed an integrated framework for OLAP and data mining and
developed new parallel algorithms for attribute-oriented associations
and decision tree based classification using the multidimensional
model. Attribute focusing and attribute-oriented association rules
calculations can be done on the summary values present in the data
cube. The aggregate values present in the higher level aggregations,
and can be used to compute the probabilities of the various attribute
values. We introduced the construction of a partial data cube using
the left and right schedules to reduce the number of intermediate
aggregates for the task of 2-way and 3-way associations. We
identified calculation of support and confidence measures on the data
cube structure to adapt another popular data mining algorithm.
We proposed a new algorithm for decision tree based classification on
the OLAP infrastructure which uses lesser communication at each level
of the classification tree than the previous attribute-list based
scalable versions in the literature. Dimensional operations are used
to maintain class id. counts for each dimension on each node of the
tree. Communication is done to exchange these summary counts and
concatenated parallelism is used to perform only one communication
round at each level. The message size is of the order of the active
nodes at a level and the dimension sizes which is usually much less
than the number of total records.
Post-PhD Research at Northwestern
MAFIA - Parallel Large Scale Subspace Clustering
Clustering techniques are used in database mining for finding
interesting patterns in high dimensional data. These are useful in
various applications of knowledge discovery in databases. Some
challenges in clustering for large data sets in terms of scalability,
data distribution, understanding end-results, and sensitivity to input
order, have received attention in the recent past. Recent approaches
attempt to find clusters embedded in subspaces of high dimensional
data. In this research we propose the use of adaptive grids for efficient
and scalable computation of clusters in subspaces for large data sets
and large number of dimensions. The bottom-up algorithm for subspace
clustering computes the dense units in all dimensions and combines
these to generate the dense units in higher dimensions. Computation is
heavily dependent on the choice of the partitioning parameter chosen
to partition each dimension into intervals (bins) to be tested for
density. The number of bins determines the computation requirements
and the quality of the clustering results. Hence, it is important to
determine the appropriate size and number of the bins.
We present
MAFIA, which 1) proposes adaptive grids for fast subspace clustering
and 2) introduces a scalable parallel framework on a shared-nothing
architecture to handle massive data sets. Performance results on very
large data sets and a large number of dimensions show very good
results, making an order of magnitude improvement in the computation
time over current methods and providing much better quality of
clustering.
(work with Harsha Nagesh and Alok Choudhary)