Project Description

Multidimensional Analysis and OLAP

Project Members: Sanjay Goil and Alok Choudhary

Multidimensional Analysis and On-Line Analytical Processing (OLAP) operations require summary information on multi-dimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values. Simultaneous calculation of multi-dimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions. This is computed only partially if the number of dimensions is large. Query processing for these applications require different views of data to gain insight and for effective decision support. These may either be answered from a materialized cube in the data cube or calculated on the fly.
The multi-dimensionality of the underlying problem can be represented both in relational and multi-dimensional databases, the latter being a better fit when query performance is the criteria for judgment. Relational databases are scalable in size for OLAP and multidimensional analysis and efforts are on to make their performance acceptable. On the other hand multi-dimensional databases have proven to provide good performance for such queries, although they are not very scalable. In this project we are addressing scalability in multi-dimensional systems for OLAP and multi-dimensional analysis.
Multidimensional databases store data in multidimensional structure on which analytical operations are performed. A challenge for these systems is how to handle large and sparse data sets with large dimensionality. Parallel computing is necessary to address the performance issues for these data sets.
Chunking is used to store data either as a dense block using multi-dimensional arrays or a sparse set using a Bit encoded sparse structure (BESS). Chunks provide a multi-dimensional index structure for efficient dimension oriented data accesses much the same as multi-dimensional arrays do. Operations within chunks and between chunks are a combination of relational and multi-dimensional operations depending on whether the chunk is sparse or dense.

In this research we address issues of data partitioning, parallelism, schedule construction, data cube building, chunk storage and memory usage on a distribute memory machine architecture. The following figures summarize the various available options, especially in terms of storage of cubes, parallelism at different levels, aggregation calculation orderings and the chunk access for aggregations.

Data Mining

Project Members: Sanjay Goil and Alok Choudhary

Data mining can be viewed as an automated application of algorithms to detect patterns and extract knowledge from data. An algorithm that enumerates patterns from, or fits models to, data is a data mining algorithm. Data mining is a step in the overall concept of knowledge discovery in databases (KDD). Large data sets are analyzed for searching patterns and discovering rules. Automated techniques of data mining can make OLAP more useful and easier to apply in the overall scheme of decision support systems. Data mining techniques like Associations, Classification, Clustering and Trend analysis can be used together with OLAP to discover knowledge from data.

Discovery of quantitative rules is associated with quantitative information from the database. The data cube represents quantitative summary information for a subset of the attributes. Transaction data can be used to mine association rules by associating and for each rule. Support of a pattern A in a set S is the ratio of the number of transactions containing A and the total number of transactions in S. Confidence of a rule A -> B is the probability that pattern B occurs in S when pattern A occurs in S and can be defined as the ratio of the support of AB and support of A. The rule is then described as A -> B [support, confidence] and a association rule has a support greater than a pre-determined minimum support and a confidence greater than a pre-determined minimum confidence. This can also be taken as the measure of ``interestingness'' of the rule. Calculation of support and confidence for the rule A -> B involve the aggregates from the cube AB, A and ALL. Additionally, dimension hierarchies can be utilized to provide multiple level data mining by progressive generalization (roll-up) and deepening (drill-down). This is useful for data mining at multiple concept levels and interesting information can potentially be obtained at different levels.
In this research we integrate data mining tasks with OLAP functionalities and perform the following data mining tasks in parallel in our framework.

Association Rules
Scalable and parallel association rule mining with attribute-oriented and attribute focusing methods for large data sets
Association rule mining has applications in cross-marketing, attached mailing, add-on sales, store layout and customer segmentation based on buying patterns to name a few.
Classification
Scalable and parallel decision tree based classification algorithms
Some applications of classification are in similar patterns detection for finding fraud, marketing to those potential customers who match the pattern of previous successful sales, healthcare treatments that are deemed cost-effective when they fit the pattern of success in previous patients, and investment decisions are made on patterns of previous gain are some application areas among many others.
Clustering
Scalable and parallel density and grid based clustering algorithms for large data sets in large number of dimensions, also incorporating subspace clustering Some applications of clustering are in creation of thematic maps in geographic information systems by clustering feature spaces, detection of clusters in satellite image data, Clustering of Web-log database to discover groups of similar access patterns which may correspond to different user profiles, or buying behavior in E-commerce systems, Identification of clusters of data on disks and tapes for efficient storage and retrieval of large scale scientific experimental data.

Clustering

Project Members: Sanjay Goil, Harsha Nagesh and Alok Choudhary

MAFIA : Efficient and Scalable Subspace Clustering for Very Large Data Sets
Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this research we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottom-up algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determines the computation requirements and the quality of the clustering results. Hence, it is important to determine the appropriate size and number of the bins.
We present MAFIA (Merging Adaptive Finite IntervAls), which 1) proposes adaptive grids for fast subspace clustering and 2) introduces a scalable parallel framework on a shared-nothing architecture to handle massive data sets. Performance results on very large data sets and a large number of dimensions show very good results, making an order of magnitude improvement in the computation time over current methods and providing much better quality of clustering.