The storing information in a data warehouse does not provide the benefits an organization is seeking. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden within the warehouse. However, as the amount and complexity of the data in a data warehouse grows, it becomes increasingly difficult, if not impossible, for business analysts to identify trends and relationships in the data using simple query and reporting tools.
Data mining is one of the best way to extract meaningful trends and patterns from huge amounts of data. Data mining discovers .information within data warehouse that queries and reports cannot effectively reveal.
Introduction to Data Mining
The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions is know as Data Mining.
Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to find the information that is hidden and unexpected.
Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Although data mining is still a relatively new technology, it is already used in a number of industries. Table lists examples of applications of data mining in retail/marketing, banking, insurance, and medicine.
Examples of data mining applications
Data Mining Techniques
There are four main operations associated with data mining techniques which include:
• Predictive modeling
• Database segmentation
• Link analysis
• Deviation detection.
Techniques are specific implementations of the· data mining operations. However, each operation has its own strengths and weaknesses. With this in mind, data mining tools sometimes offer a choice of operations to implement a technique.
Predictive Modeling
It is designed on a similar pattern of the human learning experience in using observations to form a model of the important characteristics of some task. It corresponds to the ‘real world’. It ‘is developed using a supervised learning approach, which has to phases: training and testing. Training phase is based on a large sample of historical data called a training set, while testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.
It is commonly used in customer retention management, credit approval, cross-selling, and direct marketing. There are two techniques associated with predictive modeling. These are:
• Classification
• Value prediction
Classification
Classification is used to classify the records to form a finite set of possible class values. There are two specializations of classification: tree induction and neural induction. An example of classification using tree induction is shown in Figure.
In this example, we are interested in predicting whether a customer who is currently renting property is likely to be interested in buying property. A predictive model has determined that only two variables are of interest: the length· of the customer has rented property and the age of the customer. The model predicts that those customers who have rented for more than two years and are over 25 years old are the most likely to .be interested in buying property. An example of classification using neural induction is shown in Figure.
A neural network contain collections of connected nodes with input, output, and processing at each node. Between the visible input and output layers may be a number of hidden processing layers. Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. This approach is an attempt to copy the way the human brain works· in recognizing patterns by arithmetically combining all the variables associated with a given data point.
Value prediction
It uses the traditional statistical techniques of linear regression and nonlinear regression. These techniques are easy to use and understand. Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. The problem with linear regression is that the technique only works well with linear data and is sensitive to those data values which do not conform to the expected norm. Although nonlinear regression avoids the main problems of linear regression, it is still not flexible enough to handle all possible shapes of the data plot. This is where the traditional statistical analysis methods and data mining methods begin to diverge. Applications of value prediction include credit card fraud detection and target mailing list identification.
Database Segmentation
Segmentation is a group of similar records that share a number of properties. The aim of database segmentation is to partition a database into an unknown number of segments, or clusters.
This approach uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. Applications of database segmentation include customer profiling, direct marketing, and cross-selling.
As shown in figure, using database segmentation, we identify the cluster that corresponds to legal tender and forgeries. Note that there are two clusters of forgeries, which is attributed to at least two gangs of forgers working on falsifying the banknotes.
Link Analysis
Link analysis aims to establish links, called associations, between the individual record sets of records, in a database. There are three specializations of link analysis. These are:
• Associations discovery
• Sequential pattern discovery
• Similar time sequence discovery.
Association’s discovery finds items that imply the presence of other items in the same event. There are association rules which are used to define association. For example, ‘when a customer rents property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’.
Sequential pattern discovery finds patterns between events such that the presence of one set of item is followed by another set of items in a database of events over a period of the. For example, this approach can be used to understand long-term customer buying behavior.
Time sequence discovery is used in the discovery of links between two sets of data that are time-dependent. For example, within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.
Applications of link analysis include product affinity analysis, direct marketing, and stock price movement.
Deviation Detection
Deviation detection is a relatively new technique in terms of commercially available data mining tools. However, deviation detection is often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation “and norm. This operation can be performed using statistics and visualization techniques.
Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.
Data Mining and Data Warehousing
Data mining requires a single, separate, clean, integrated, and self-consistent source of data. A data warehouse is well equipped for providing data for mining for the following reasons:
• Data mining requires data quality and consistency of input data and data warehouse provides it.
• It is advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouse contain data from a number of sources.
• Query capabilities of the data warehouse helps in selecting the relevant information.
Due to integration of data mining and data warehouse many vendors are investigating number of techniques to support it.