data mining notes.docx

上传人:b****5 文档编号:6695768 上传时间:2023-01-09 格式:DOCX 页数:13 大小:1.27MB
下载 相关 举报
data mining notes.docx_第1页
第1页 / 共13页
data mining notes.docx_第2页
第2页 / 共13页
data mining notes.docx_第3页
第3页 / 共13页
data mining notes.docx_第4页
第4页 / 共13页
data mining notes.docx_第5页
第5页 / 共13页
点击查看更多>>
下载资源
资源描述

data mining notes.docx

《data mining notes.docx》由会员分享,可在线阅读,更多相关《data mining notes.docx(13页珍藏版)》请在冰豆网上搜索。

data mining notes.docx

dataminingnotes

Intro

Problemscategories

Clustering

Classification

Regression

Dimensionreduction

DataVisualisationandInterpretation 

Descriptivestatistics

Mean

Median

Dispersion

Statisticaldistributions

Themedianisamorerobustestimatorofthecentraltendency

ThedifferenceQ3-Q1istheinterquartilerange(orIQR)it'samorerobustdispersionmeasure

NormalDistribution

SkewedDistribution

Visualisation

Boxplot

Scatterplot

Boxplot:

Theredmarkshowsthemean

Theboxgoesfromthelowerquartiletotheupperquartile

Theboxisthuscentredonthemedian

Thewhiskersaretheminimumandmaximumvalues

Outliersvaluesareshownasbluecrosses

Outliersarevalueswhicharebeyond1.5*IQRfromthequartiles

Distances

HammingDistance

Levenshteindistance

TheLevenshteindistanceistheminimumnumberofeditsneededtotransformonestringintotheother

1insertionofacharacter

2deletionofacharacter

3substitutionofacharacter

DamerauLevenshteindistance

TheDamerauLevenshteindistanceislikethe

Levenshteindistance,withonemoreeditoperation

1insertionofacharacter

2deletionofacharacter

3substitutionofacharacter

4transpositionof2adjacentcharacters

JaroDistance

K-Means (P166)

Clusteringquality

Internal:

External:

Initialization:

random

kmeans++:

distantplots

Perceptron(P67)

Multi-LayerPerceptron(P76)

AMulti-LayerPerceptronismadeofcneurons,connectedtoanoutputneuron

_Eachinnerneuronactsasanindependenthyperplanes

_Thetopneuroncombinestheindependenthyperplanes

_Wecannowclassifydatabycombininghyperplanes

Flaws

Datanormalization

MLPwon'tworkifyoudon'tnormalizeyourdata

_TheoutputoftheMLPisinalimitedrange(say[-1;1])

_Iftheinputsareoutofrange,MLPlooseinformation

Over-fitting

Pastanumberofaneurons

_Verylittleimprovementoftheerror

_Mostlylearningnoiseofthedata

_over-fitting

Trainingerror:

errorforthetrainingpoints

Testingerror:

errorforpointsnotusedduringtraining

MaximizaionStep:

Initialization

SimpleinitializationforKcomponentsandNsamples

1RandomcreateKclustersofsamples,samesize

2Initial

forCiisthesamplemeanoftheclusteri

3Initial

forCiisthesamplestandarddeviationoftheclusteri

4InitialwforCiis1/k

Kmeansinitialization

multivariateGaussianmix

DecisionTree(p40)

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 医药卫生 > 基础医学

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1