COMPARISON OF PREDICTIVE DATA MINING TECHNIQUES Academic Essay

Data mining is a process that is used in the description of database knowledge. Data mining is a useful process for extracting and identifying important information and successive knowledge. Data mining involves searching huge amounts of data to identify important patterns and relationships. Various techniques have been proposed for data mining. Each of these techniques has its strengths and weaknesses and they are suited for different types of data. The proposed study aims to compare various predictive data mining techniques to identify the best approach for particular data sets. Particularly, regression (Partial Least Squares (PLS), Principal Component Regression (PCR) and Multiple Linear Regression (MLR)), time series (Auto-Regressive Integrated Moving Average (SARIMA), Hidden Markov and Self-Organizing Maps (SOM)) and classification methods (neural networks, decision trees and Bayesian classification) will be compared. Kappa statistics, Accuracy, sensitivity and specificity (ASS), Receiver Operating Characteristics (ROC) and the Mean Absolute Error (MAE) will be used to compare the techniques. In order to conduct the comparisons, various open source tools including RapidMiner, Knime, Weka and Orange will be utilized. The outcome of the study will enable data scientists and analysts to identify the best data mining techniques for a given data set.

Keywords: Data Mining, prediction techniques, regression, time series, classification

Comparison of predictive data mining techniques

Overall Problem

Data mining is a process that is used in the description of database knowledge. Data mining is a useful process for extracting and identifying important information and successive knowledge (Li, Nsofor & Song, 2009). Various techniques are used in data mining including mathematical, statistical, machine learning and artificial intelligence techniques (Sondwale, 2015). Data mining is an important field because it provides important information that can be used to support forecasting and decision-making in many fields including sociology, psychology, science, military and business. The unique predictive power of data mining comes from its unique combination of techniques from statistics, pattern recognition and machine learning to automatically extract information and identify the targeted patterns and interrelations from big data (Sondwale, 2015). Organizations often utilize the power and capabilities of data mining to discover hidden patterns in data. The extracted data is used to develop data mining models that can be utilized in predicting behavior and performance with high levels of accuracy (Li, Nsofor & Song, 2009). Identifying appropriate techniques for data mining is a key challenge due to the availability of multiple techniques.
Quick Decomposition of the Problem

Big data came into existence in the 1990s and it is one of the steps of analysis in the knowledge discovery in databases (KDD) process. Data mining involves multiple data and database management aspects including pre-processing, discovery of patterns and post-processing of discovered structures, updating and visualization (Sondwale, 2015). Due to its comprehensive approach to analysis, data mining is useful for complex environments. Today, advancement in technology has made it possible to monitor and record all aspects of life (Han, Kamber & Pei, 2011). The rapid growth of social networking, augmented and virtual reality and electronics have led to the generation of huge amount of data. Nearly all daily activities such as health records, sales records, social conversations, etc. Moreover, the generation of this data has led to the intersection of the statistics and scientific fields. The availability of huge amounts of data has created a problem because the traditional approaches to data extraction and analysis have become obsolete.

The objective of data mining is twofold including generation of descriptive models and generation of predictive models. Descriptive data mining involves the analysis and identification of the general properties of data present in a given data set. Descriptive data mining utilizes methods such as sequence analysis, association rules, clustering and summarizations. On the other hand, predictive data mining involves making inferences from existing data to make predictions. Predictive data mining utilizes methods such as regression analysis, classification, prediction and time series analysis. The proposed study will focus on predictive data mining techniques. The first chapter provides the introduction to the topic of big data and the lifecycle of data as well as the objectives of the study. The second chapter will present a review of the literature to identify any research gaps. Finally, the third chapter will present the methodology that will be adopted in the study.

Background of the Study

In the modern business environment, data can take different forms including unstructured and structured data such as multimedia files, genetic mapping, text files and financial data (EMC Education Service, 2015). Different from conventional data analysis, the modern data is mostly semi-structured and unstructured and hence they require different tools and techniques to extract and analyze the required information. According to the EMC Education Service (2015), the data analytics lifecycle comprises of six phases including discovery, data preparation, model planning, model building, communication of the results and operationalization. Discovery involves the examination of relevant history to assess the resources available for the project including data, time, people and technology.

Once the assessment has been done, the data preparation phase follows where the project team work with data and perform the analysis for the project duration. The data preparation phase involves extraction, loading and transformation (often shortened as ETLT) of the data. After preparing the data, the next step is to plan the model whereby the techniques, methods and workflows that are expected to be used are identified. When the model has been properly planned, the next phase is to build the model whereby the project team identifies data sets for training, production and testing purposes. The outcomes of the model are then communicated to relevant stakeholders for analysis before the technical reports and documents are compiled. The data analytics lifecycle is important in data mining because it offers guidance on the expectations of each phase.
Motivation

If these data generated from various sources is effectively analyzed, interpreted and integrated, they can offer key insights that can be used to mitigate societal issues such as employment, economics and health (Li, Nsofor & Song, 2009). As such, it is important to identify data mining techniques that can best fit the current analytics requirements. The proposed study aims to identify the best data mining techniques for the current wave of data.

Predictive data mining techniques are the most developed and they are the most important techniques in decision making. Various factors including the competence of the analyst as well as the available techniques. During the data mining process, massive resources and time are wasted assessing each prediction technique to fit the needs of a given data set.
Justification of the Problem

With the advancement in data prediction techniques, it is important to identify the techniques that are best suited for a given data set. This process is important for data scientists and analysts because substantial resources are required to test each technique to identify the one that suits a given data set. As such, it is important to develop a methodology for testing data mining technique as well as test the methodology using key data sets in the United States. This methodology can be used by other analysts and scientists to test their data sets to identify the best technique for their data sets.
Deliverables

This study entails an experimental design to test the best technique for data mining. In the proposed study, key prediction tools including regression, time series analysis and classification are compared to identify the best technique for the given data sets. The regression techniques considered in the comparison are Partial Least Squares (PLS), Principal Component Regression (PCR) and Multiple Linear Regression (MLR). The classification models considered in the comparison are neural networks, decision trees and Bayesian classification. The time series techniques to be considered include Auto-Regressive Integrated Moving Average (SARIMA), Hidden Markov and Self-Organizing Maps (SOM).

Best Data Mining Technique

Figure 1: The proposed study

Research Question

The research question guiding the proposed study is: what is the best prediction technique for data mining?

Significance of the study

The outcome of this research is supposed to be the knowledge of which data mining technique works best for a particular predictive analytics application and why, and how to tweak certain parameters to derive reasonably accurate and satisfactory insights that are helpful to

Small businesses that cannot afford

Proprietary and expensive vendor-offered analytics solutions or
Building brand-new in-house analytics framework
Individuals that want to make important decisions such as (by using free tools and data)
Right time/location to purchase a home/vehicle/travel tickets etc.

Scope of the study

I do not plan to include data mining approaches that heavily rely on machine learning since any approach can be more or less improved by repeating the learning algorithm across several data sets and adjusting variables. Hence, their scope can’t be well defined to compare apples-to-apples with other mining techniques.

The conclusions of the study will only be reasonably consistent if the following are fulfilled

All the studied techniques can be tested with a single tool on a single data set.
The tool can provide the capability to customize the parameters studied as part of the research
The data set is of acceptable quality and standards required by the techniques studied (data density, completeness, accuracy requirements, etc. vary by technique)

Hence, based on above feasibility, study scope may have to be qualitatively or quantitatively broadened/narrowed down/deepened. Otherwise, the results of this research study could be relative, skewed and incomplete, for not being able to test all parameters.

The predictive data mining model is mainly used for making future predictions based on the current behavior. Various techniques have continually been proposed for data mining and this has created a persistent problem. The predictive models are supervised learning functions that predict the target value. This section outlines the prediction techniques that will be used in the proposed study. The advantages and disadvantages of each prediction model are also outlined in this section. The techniques to be explored are presented in the chart below.

Classification techniques

The classification model is the best understood data mining technique among the predictive techniques (Rokach & Maimon, 2014). Classification techniques have wide applications including modelling businesses, customer segmentation, credit analysis, etc. In the classification model, historical data is utilized to create a new model that can accurately be used to predict future behavior. There are various classification algorithms that can be used for identifying relations between predictor values (historical data) and target values. The classification models covered in this paper include neural networks, decision trees and Bayesian classification.

Decision Trees

A decision tree is a structure that contains a root structure that has root attributes and it terminates with leaf nodes. Decision trees often contain multiple attributes. The generation of the decision tree is based on the information-gain measure using three procedures.

Information-gain is an impurity based measure that utilizes the entropy measure as the impurity measure (Rokach & Maimon, 2014). The information gained is given by

I (ai, S) =

In this context, Entropy(y, S) =

S is the training data set while y is the target attribute. Ai is the reduction impurity of the target attribute (Zhang et al., 2009).

Decision tree algorithms are used to describe attribute relationships as well as the relative importance of these attributes. Moreover, human-readable rules can be extracted from decision trees and the classification and learning steps used in decision tree induction are often fast (Danjuma & Osofisan, 2015). However, decision trees are not best suited for complex and large data sets due to the large computing power and classification iterations required.

Bayesian Classifiers

Bayesian classifiers are statistical classifiers that are utilized in the prediction of class membership by probabilities. Various Bayesian algorithms have been developed based on the naïve Bayes and Bayesian networks methods. In the Naïve Bayesian Algorithm, it is assumed that the attribute effect on a given class is independent of the values of other attributes. However, attributes often have some dependency in practice and hence the Naïve Bayesian Algorithm is not relevant. Bayesian Networks are graphical models that can be used to describe joint conditional probability distributions. For instance, consider a sample X whose class is to be determined and let the hypothesis be H so that X belongs to a given class C. The main objective in Bayesian classification is to determine the probability that H holds for a given sample X (P (H/X)) as shown below.

P (H/X) =

Where P (H) is the prior probability, P(X/H) is the posterior probability. In this formula, P(X), P(X/H) and P (H) can be estimated from historical data and hence the hypothesis can be tested (Zhang et al., 2009).

The key strength of the Bayesian classifiers is their high accuracy and speed when they are applied to large data sets. Moreover, Bayesian networks provide an opportunity to the analysts to utilize their expert knowledge rather than coded data. Bayesian networks are also superior to decision trees because can capture interactions among input variables. However, the Bayesian classifiers are complex and hence require sufficient competence to apply (Danjuma & Osofisan, 2015).

Neural Networks

Neural networks are composed of connected input/output components and each connection has its own weight. The neural network is composed of three layers including the input layer, hidden layers and the output layer (Zhang et al., 2009). Different from decision trees, neural networks have multiple input nodes and each input node is associated with a given attribute. The neural network can adjust the weights of each attribute during the learning process to satisfy the output-input relations. The hidden layers are used to regulate the weight of each node so that the input-output relations can be satisfied (Tapak, Mahjub, Hamidi & Poorolajal, 2013). Neural networks are very popular in classification and prediction because of their advantages such as high noise tolerance and the capability to classify and predict unseen patterns (Danjuma & Osofisan, 2015).

Regression Techniques

Regression is another key predictive data mining model whereby the dependency of the attributes is analyzed. The key difference between regression and classification techniques is that classification techniques deal with categorical or discrete targets while regression techniques deal with continuous or numerical target attributes. Linear regression models are the most common tools used in data mining and they involve the development of the best line of fit that minimizes the average points of the data set. The most common linear regression models are Partial Least Squares (PLS), Principal Component Regression (PCR) and Multiple Linear Regression (MLR).

Partial Least Squares (PLS)

PLS is a key prediction technique where the input data (x) and output data (y) are transformed to new scores of t and u respectively. After successful transformation of the data, linear mapping is carried out between the new input variable (t) and the new output variable (u). Additional analysis is also conducted on the new input and output scores to establish the loading vectors P and Q. The key strength of PLS is that it reveals the maximum covariance using the minimum number of variables.

Multiple Linear Regression

In multiple linear regression models, the coefficient of determination is expressed as a percentage of the total variation and this relationship is expressed by the outcome variable and the independent variable. If the coefficient of determination is correctly interpreted, it can be useful in determining whether the predicted model is substantial enough to draw inferences. The coefficient of determination (R2) ranges from zero to one and it can be calculated as follows;

R2=

The formula for calculating multiple regression is represented by

? = a0 + a1x1 + a2x2 + a3x3+…+anxn

Although the multiple linear regression is effective in prediction, it has its flaws. For instance, the estimation of the regression coefficients is expected to be unstable because of multiple collinearity in prediction models with multiple variables. Moreover, the use of independent variables that are not correlated with the dependable variable can increase the variance of the predictions.

Principal Component Regression (PCR)

PCR is the most common dimension reduction method that utilizes linear projections to identify the underlying variance of data. PCR incorporates three stages including the establishment of the principal components, identification of relevant principle components of the prediction model and multiple linear regression. PCR is perceived as a special approach under the Singular Value Decomposition (SVD) Algorithm. The formula for decomposing the inputs and transforming the principal components are shown below.

X = U*S*VT (Decomposition of the inputs)

X = Z1V1T+ Z2V2T+……+ZnVnT+E (Transformation of the components)

In the selection of the principal components, the Best Subset Selection (BSS) approach is often used because it considers the input and output variables. Although the PCR is a key technique in data mining, it has a limitation in that it only focuses on the input variables.

Time Series (TS) Techniques

Time-series analysis is a prediction technique with one or multiple time-dependent attributes, the technique involves prediction of numeric outcomes such as future stock prices. Time series analysis provides the ability to visualize the structure of data and identify the trend over a period of time. The most common models for TS analysis are Auto-Regressive Integrated Moving Average (SARIMA), Hidden Markov and Self-Organizing Maps (SOM) (Fu, 2011; Esling & Agon, 2012).

Expected Outcome: The main objective of the proposed study is to identify the best predictive data mining model for a given data set.

How the results will be realized: To realize this objective, various predictive data mining techniques including regression (MLR, PCR and LSR), time series (SARIMA, Markov and SOM) and classification (neural networks, Bayesian Classifiers and Decision Trees) will be compared. Various measures will be used in the analysis as follows;

Accuracy, sensitivity and specificity (ASS): this is a common statistical measure that is used to estimate the efficacy and consistency of the techniques. Various measures such as the false positive rate, and the negative rate are used to estimate the efficiency of data mining techniques (Danjuma & Osofisan, 2015).

Receiver Operating Characteristics (ROC): this is the ratio of the false positive and the true positive. The ROC curve can be used to measure the performance of data mining algorithms (Danjuma & Osofisan, 2015).

Mean Absolute Error (MAE): this is the sum of the absolute value of the errors. MAE can be calculated as follows;

Kappa Statistics: This parameter measures the agreement between two rates. The parameter can be used to measure the agreement between the predicted values and the prediction (Pittman, 2008).

The outcome of this study will be very important for data scientists and analysts because they will be able to identify the best technique for data mining. Moreover, the proposed methodology can be applied to other data sets to identify the best technique for a particular group of data sets, for example, health care records.

Procedure

Data Preparation

Once the data sets to be used in the analysis have been identified, the first step is to prepare the data for mining. Data preparation is an important phase that is used to structure the data so that it can fit the proposed comparison techniques. Data preparation involves data cleaning, data integration, data selection and data transformation. In the proposed study, a relationship check (data cleaning) will be carried out by comparing the inputs and outputs of the data sets before integration. This step aims to remove noise and address missing variables. The main aim of addressing missing variables is because eliminating them completely would reduce the amount of data available for creating the model. Noise is addressed by using smoothing techniques while the missing values are filled with the mean value. Examples of smoothing techniques include combine inspection, clustering, binning and regression. For the proposed study, regression techniques will be used to reduce the noise effects.

Once the relationship check has been done, data transformation will follow. Data transformation involves standardization or scaling of the data sets and this step aims at reducing the dispersion level of the variables. Normalization will also be carried out whereby the attributes will be scaled to make them fit the required range. Moreover, correlation analysis will be used to identify the relationships between the attributes. The outcome of data transformation is structured data that can be used to develop the testing models. During the model development phase, data mining tools are used to analyze the data and identify relationships and patterns or generate rules. However, the patterns that are identified by data mining tools are not often useful and thus data experts are required to identify useful patterns that conform to the objectives of the study.

Data Mining

After data preparation, the second step is data mining. Data mining involves the extraction of potentially useful patterns using data mining tools. For this step, the data sets will be categorized into two parts including the test validation data set and the training set. The training set will be used in building the models while the test validation data set will be used to assess the models developed. The assessment is essential to ensure that the models created are valid and reliable to meet the objectives of the study. The outcome of the training data will not be recorded because it is not relevant for the proposed study. Once the models have been developed and assessed, Kappa statistics, Accuracy, sensitivity and specificity (ASS), Receiver Operating Characteristics (ROC) and the Mean Absolute Error (MAE) will be used to compare the prediction power of the techniques. In order to conduct the comparisons, various open source tools including RapidMiner, Knime, Weka and Orange will be utilized. After mining the data, visual techniques will be used to present the data to the users. This step is important because it enables the users to interpret and understand the data.

The proposed study is expected to be carried over eight weeks and it will be completed before the end of summer. The outline of each week is shown below;

The first week will be used for the selection of the key elements of the study including applications, techniques, tools and data sets.
After the selection of the study elements, the second week will be used to study and report various predictive analytics applications that use data mining techniques. This step will enable identification of applicable techniques.
The third week will be used to study and report data mining processes as well as the major stages of the processes.
The fourth week will be used to study and report popular predictive data mining approaches and their respective impact on the outcome. The pros and cons of the approaches will be assessed during this period. Moreover, the efficacy of the techniques will be assessed using qualitative techniques.
Once the predictive data mining techniques have been identified, the next step is to identify appropriate data sets. As such, the fifth week will be used to compare and identify the right data sets that can be applied the techniques identified in the fought week.
The sixth week will be used to identify pattern parameters for data mining such as classification, association, etc.
Once, the pattern parameters have been identified, the seventh week will be used to identify appropriate open source data mining tools that can be used process and compare the identified techniques.
The final week will be used to carry out the study and compare the results using visualization techniques. This process is summarized in the table below.

Task Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8
Selection of the key elements of the study (applications, techniques, tools and data sets)
study and report various predictive analytics applications
study and report data mining processes as well as the major stages of the processes
study and report popular predictive data mining approaches and their respective impact on the outcome
Compare and identify the right data
Identify pattern parameters for data mining
Identify appropriate open source data mining tools
Carry out the study and compare the results.

References

Danjuma, K., & Osofisan, A. O. (2015). Evaluation of Predictive Data Mining Algorithms in Erythemato-Squamous Disease Diagnosis. arXiv preprint arXiv:1501.00607.

EMC Education Services. (2015). Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data . Wiley.

Esling, P., & Agon, C. (2012). Time-series data mining. ACM Computing Surveys (CSUR), 45(1), 12.

Fu, T. C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164-181.

Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques. Elsevier.

Li, X., Nsofor, G. C., & Song, L. (2009). A comparative analysis of predictive data mining techniques. International Journal of Rapid Manufacturing, 1(2), 150-172.

Pittman, K. (2008). Comparison of data mining techniques used to predict student retention. ProQuest.

Rokach, L., & Maimon, O. (2014). Data mining with decision trees: theory and applications. World scientific.

Sondwale, P. P. (2015). Overview of Predictive and Descriptive Data Mining Techniques. International Journal of Advanced Research in Computer Science and Software Engineering, 5(4), 262-265.

Tapak, L., Mahjub, H., Hamidi, O., & Poorolajal, J. (2013). Real-data comparison of data mining methods in prediction of diabetes in Iran. Healthcare informatics research, 19(3), 177- 185.

Zhang, S., Tjortjis, C., Zeng, X., Qiao, H., Buchan, I., & Keane, J. (2009). Comparing data mining methods with logistic regression in childhood obesity prediction. Information Systems Frontiers, 11(4), 449-460.

find the cost of your paper
Is this question part of your assignment?
Place order
Posted on May 24, 2016Author TutorCategories Question, Questions

CLICK BUTTON TO ORDER NOW

Solution

This question has been answered.