There are several algorithms available for data mining. among those some well known and widely used algorithms are discussed below. These algorithms are widely used in IEEE Community and were presented in IEEE International Conference on Data Mining (ICDM,HongKong 2006)cite{wu2008top} as top algorithms for data mining procedure. These algorithms were associated with classifying, clustering, statistical decision making etc. subsection{Na”ive Bayes}Na”ive Bayes methods are a set of supervised learning method and simple probabilistic classifiercite{scikit-learn} based on Bayes theorem with strong naive(independence) assumption between every pair of featurescite{Wiki_naive}. For a given class variable z and a feature vector $X(x_1,x_2,……,x_n)$ from Bayes’ theorem the following relation can be got egin{equation*}Pleft(y middlevert x_1,…x_n

ight)=frac{P(y)Pleft(x_1,…x_n middlevert y

ight)}{P(x_1,…..x_n)}end{equation*}

ewlineIf naive(independence) assumption that is $$Pleft(x_i,…,x_n-1,x_n) middlevert y

ight)=Pleft(x_i middlevert y

ight)$$ is applied for all i the equation can be simplified as bellowegin{equation*}Pleft(y middlevert x_1,…x_n

ight)=frac{P(y) prod_{i=1}^{n}{Pleft(x_i middlevert y

ight)}}{P(x_1,…..x_n)}end{equation*}

ewlineAs the denomination of the equation(2) $P(x_1,….,x_n)$ is a constant input for all i thus the the equation can be written as a simple classifier:$$Pleft(y middlevert x_1,…x_n

ight)=P(y) prod_{i=1}^{n}{Pleft(x_i middlevert y

ight)}$$

ewline Now we can use maximum posterior estimation to estimate the prior probability of P(y) and class conditional probability of $(Pleft(ymiddlevert x_i

ight))$. The different na”ive bayes classifier differ mainly by the making the distribution of class conditional probability $(Pleft(ymiddlevert x_i

ight))$. Na”ive bayes is faster than any other complex classifier algorithm because of its simplicity. In spite of having simple assumption na”ive bayes works fine in many real world problem . This classifier works properly though it may calculate the probabilities wrongcite{rish2001empirical}. A lot of improvements on Na”ive Bayes has been introduced subsection{Decision Tree}Decision tree is a machine learning approach that uses a tree to observe an item and take decisions on it. Each internal node denotes observation(test) on an attribute, each branch denotes the outcome(decision) of a test(observation), and each leaf node holds a class label.cite{Decision}. The top most node in the tree is the root node. There are multiple number of algorithm to build decision tree. C4.5 algorithm is an extension of Ros Quinlan’s IDL3 algorithm.cite{Salzberg1994}. C4.5 generates a decision tree that can be used as classifier. C4.5 builds a tree from a set of training data set using the information entropy. It uses the divide and conquer strategy to form a tree cite{quinlan1996improved} . C4.5 algorithm split the data by choosing the appropriate feature. The split criterion is the normalized information gained. The attribute with the highest normalized information is chosen. It uses Pearson’s entropy to calculate the information gained using :egin{equation*} extit{E(s)} = -sum p_ilog_2 p_iend{equation*}subsection{Artificial Neural Network- Perceptron}A Multi Layer Perceptron (MLP) is a network consists of artificial neurons called perceptronscite{MPL}.The perceptrons generates a single output by computing multiple real number of inputs. It uses its input weights to generate a linear combination and then use the output through some non linear activation function cite{MPL}. Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function $f(cdot): R^m

ightarrow R^o$ by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features $X = {x_1, x_2, …, x_m}$ and a target y, it can compute a non-linear estimator for either classification or regression. It is different from linear regression as there can be one or more layers between input and output layer called as hidden layercite{sup}.The single layer perceptron does not have a priori knowledge, so the initial weights are assigned randomly. SLP(Single Layer Perceptron) sums all the weighted inputs and if the sum is above the threshold (some predetermined value), SLP is said to be activated (output=1). $$w_1x_1+w_2x_2,……,w_nx_n> heta -> output_1$$$$w_1x_1+w_2x_2,……,w_nx_n< heta -> output_2$$The input values are presented to the perceptron, and if the predicted output is the same as the desired output, then the performance is considered satisfactory and no changes to the weights are made. However, if the output does not match the desired output, then the weights need to be changed to reduce the error. subsection{Support Vector Machine}Support Vector Machine (SVM) is a very widely used algorithm as in some particular cases it is a very robust and accuratecite{vapnik1995nature}. SVM can be used for both classification and regression but it is mostly used for classification purposecite{SVM}. Each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate is plotted. Then, classification is performed by finding the hyper-plane that differentiate the two classes very well (figure

ef{SVM_1},cite{SVM}).egin{figure}htpcenteringincludegraphicswidth=15cm,height=8cm{figures/SVM_1.png}caption{Hyper Plane of a Support Vector machine}label{SVM_1}end{figure}The co-ordinates are called support vector and the hyper plane is know as support vector machine.In practical SVM is implemented using different kernel (figure

ef{SVM_kernel}).egin{figure}htpcenteringincludegraphicswidth=15cm,height=10cm{figures/sphx_glr_plot_iris_001.png}caption{SVM using Different Kernels}label{SVM_kernel}end{figure}An implementation SVM is show in Appendix %appendix d ee code boshbesubsection{K-Nearest Neighbour}K-nearest neighbour technique is another widely used algorithm for both classification and regression. But mostly it it is used for classification and clustering purpose. The output of a k-nn depends on whether it is used for classification or regressioncite{Knn}.The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.).Despite its simplicity, nearest neighbors has been successful in a large number of classification and regression problems, including handwritten digits or satellite image scenes. Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) figure

ef{KNN} :egin{figure}Hcenteringincludegraphicswidth=15cm{figures/scenario1.png}caption{K-nearest Neighbour Classifier Scenario 1}label{KNN}end{figure}We want to find the blue one in the figure

ef{KNN}. S can either be RC or GS and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three data points on the plane. Refer to following diagram for more details figure

ef{KNN2}:egin{figure}Hcenteringincludegraphicswidth=15cm{figures/scenario2.png}caption{K-nearest Neighbour Classifier Scenario 2}label{KNN2}end{figure}The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.subsection{Fuzzy Modeling }Fuzzy logic are a special form of multi-valued logic that is designed to handle the concept of partial truth. That means the value can be varied among complete true and complete falsecite{novak1999j}. Fuzzy model can a take a logic of any value between 0 and 1 unlike binary logic. Fuzzy rule bases can represent both classification and regression functions, and different types of fuzzy models have been used for these purposes. In order to realize a regression function, a fuzzy system is usually wrapped in a “fuzzifier” and a “defuzzifier”: The former maps a crisp input to a fuzzy one, which is then processed by the fuzzy system, and the latter maps the (fuzzy) output of the fuzzy system back to a crisp value. For so-called Takagi-Sugeno models, which are quite popular for modeling regression functions, the defuzzification step is unnecessary, since these models output crisp values directly. An alternative is to proceed from a fixed fuzzy partition for each attribute, i.e., a regular “fuzzy grid” of the input space, and to consider each cell of this grid as a potential antecedent part of a rule cite{wang1992generating}. This approach is advantageous from an interpretability point of view. On the other hand, it is less flexible and may produce inaccurate models when the one-dimensional partitions define a multi-dimensional grid that does not reflect the structure of the data. Fuzzy modeling is good for non linear dataset like medical data.