This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities:
- Data Cleaning − data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute;
- Relevance Analysis − database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.
Data Transformation and reduction
The data can be transformed by any of the following methods.
Normalization − the data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.
Generalization − the data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction:
- Accuracy − accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data;
- Speed − this refers to the computational cost in generating and using the classifier or predictor;
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction:
- Robustness − it refers to the ability of classifier or predictor to make correct predictions from given noisy data;
- Scalability − scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data;
- Interpretability − it refers to what extent the classifier or predictor understands.
Decision trees
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node (fig. 6.2).
The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
Figure 6.2 - Decision trees
The benefits of having a decision tree are as follows:
- it does not require any domain knowledge;
- it is easy to comprehend;
- the learning and classification steps of a decision tree are simple and fast.
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.
Here is the Tree Pruning Approaches listed below:
- Pre-pruning − the tree is pruned by halting its construction early;
- Post-pruning - this approach removes a sub-tree from a fully grown tree.
The cost complexity is measured by the following two parameters:
- number of leaves in the tree;
- error rate of the tree.
Дата: 2019-02-02, просмотров: 371.