1、data mining notesIntroProblems categoriesClusteringClassificationRegressionDimension reductionData Visualisation and InterpretationDescriptive statisticsMeanMedianDispersionStatistical distributionsThe median is a more robust estimator of the central tendencyThe difference Q3 - Q1 is the interquarti
2、le range (or IQR) its a more robust dispersion measureNormal DistributionSkewed DistributionVisualisationBoxplotScatter plotBoxplot:The red mark shows the meanThe box goes from the lower quartile to the upper quartileThe box is thus centred on the medianThe whiskers are the minimum and maximum value
3、sOutliers values are shown as blue crossesOutliers are values which are beyond 1.5* IQR from the quartilesDistancesHamming DistanceLevenshtein distanceThe Levenshtein distance is the minimum number of edits needed to transform one string into the other1 insertion of a character2 deletion of a charac
4、ter3 substitution of a characterDamerau Levenshtein distanceThe Damerau Levenshtein distance is like theLevenshtein distance, with one more edit operation1 insertion of a character2 deletion of a character3 substitution of a character4 transposition of 2 adjacent charactersJaro DistanceK-Means(P166)
5、Clustering qualityInternal: External: Initialization:random kmeans+: distant plotsPerceptron (P67) Multi-Layer Perceptron(P76) A Multi-Layer Perceptron is made of c neurons, connected to an output neuron_ Each inner neuron acts as an independent hyperplanes_ The top neuron combines the independent h
6、yperplanes_ We can now classify data by combining hyperplanesFlawsData normalizationMLP wont work if you dont normalize your data_ The output of the MLP is in a limited range (say -1; 1)_ If the inputs are out of range, MLP loose informationOver-fittingPast a number of a neurons_ Very little improve
7、ment of the error_ Mostly learning noise of the data_ over-fittingTraining error: error for the training pointsTesting error: error for points not used during training Maximizaion Step: InitializationSimple initialization for K components and N samples1 Random create K clusters of samples, same size2 Initial for Ci is the sample mean of the cluster i3 Initial for Ci is the sample standard deviation of the cluster i4 Initial w for Ci is 1/kKmeans initializationmultivariate Gaussian mix Decision Tree(p40)
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1