piaips Using libSVM Tutorialpiaip 的 libSVM 简易入门.docx

资源描述

piaips Using libSVM Tutorialpiaip 的 libSVM 简易入门.docx

《piaips Using libSVM Tutorialpiaip 的 libSVM 简易入门.docx》由会员分享，可在线阅读，更多相关《piaips Using libSVM Tutorialpiaip 的 libSVM 简易入门.docx（15页珍藏版）》请在冰豆网上搜索。

piaips Using libSVM Tutorialpiaip 的 libSVM 简易入门.docx

piaipsUsinglibSVMTutorialpiaip的libSVM简易入门

piaip'sUsing（lib）SVMTutorialpiaip的（lib）SVM簡易入門

piaipatcsiedotntudotedudottw,

Hung-TeLin

FriApr1815:

04:

53CST2003

$Id:

svm_tutorial.html,v1.132007/10/0205:

51:

55piaipExppiaip$原作：

林弘德，轉載請保留原出處

Whythistutorialishere

我一直覺得SVM是個很有趣的東西，不過也一直沒辦法（mostly衝堂）去聽林智仁老師的Datamining跟SVM的課；後來看了一些網路上的文件跟聽kcwu講了一下 libsvm 的用法後，就想整理一下，算是對於並不需要知道完整SVM理論的人提供使用 libsvm 的入門。

原始libsvm的README跟FAQ也是很好的文件，不過你可能要先對svm跟流程有點了解才看得懂（我在看時有這樣的感覺）；這篇入門就是為了從零開始的人而寫的。

I'vebeenconsideringSVMasaninterestingandusefultoolbutcouldn'tattendthe"DataminingandSVM"coursebyprof.cjlineaboutit（mostlyduetoschedulingconflicts）.Afterreadingsomematerialsontheinternetanddiscussing libsvm withsomeofmyclassmatesandfriends,IwantedtoprovidesomenoteshereasatutorialforthosewhodonotneedtoknowthecompletetheorybehindSVMtheorytouse libsvm .TheoriginalREADMEandFAQfilesthatcomeswithlibsvmaregooddocumentstoo.ButyoumayneedtohavesomebasicknowledgeofSVManditsworkflow（that'showIfeltwhenIwasreadingthem）.Thistutorialisspecificlyforthosestartingfromzero.

後來還有一些人提供意見，所以在此要感謝：

Imustthanktheseguyswhoprovidedfeedbackandhelpedmemakethistutorial:

kcwu,biboshen,puffer,somi

不過請記得底下可能有些說法不一定對，但是對於只是想用SVM的人來說我覺得這樣說明會比較易懂。

Rememberthatsomeaspectbelowmaynotbecorrect.Butforthosewhojustwishto"USE"SVM,Ithinktheexplanationbelowiseasiertounderstand.

這篇入門原則上是給會寫基本程式的人看的，也是給我自己一個備忘,不用太多數學底子，也不用對SVM有任何先備知識。

Thistutorialisbasicallyforpeoplewhoalreadyknowhowtoprogram.It'salsoamemotomyself.NeithertoomuchmathmaticsnorpriorSVMknowledgeisrequired.

還看不懂的話有三個情形,一是我講的不夠清楚,二是你的常識不足,三是你是小白^^;Ifyoustillcan'tunderstandthistutorial,therearethreepossibilities:

1.Ididn'texplainclearlyenough,2.Youlacksufficientcommonknowledge,3.Youdon'tuseyourbrainproperly^^;

我自己是以完全不懂的角度開始的，這篇入門也有不少一樣不懂SVM的人看過、而且看完多半都有一定程度的理解，所以假設情況一不會發生，那如果不懂一定是後兩個情況:

P也所以,有問題別問我。

SinceIbeginwritingthismyselfwithnounderstandingofthesubject,ansthisdocumenthasbeenreadbymanypeoplewhoalsodidn'tunderstandSVMbutgainedacertainlevelofunderstandingafterreadingit,possibility1canberuledout.Thusifyoucan'tunderstandityoumustbelongtothelattertwocategories,:

Pthusevenifyouhaveanyquestionsafterreadingthis,don'taskme.

SVM:

Whatisitandwhatcanitdoforme?

SVM,SupportVectorMachine ,簡而言之它是個起源跟類神經網路有點像的東西，不過現今最常拿來就是做分類（classification）。

也就是說，如果我有一堆已經分好類的東西（可是分類的依據是未知的！

），那當收到新的東西時，SVM可以預測（predict）新的資料要分到哪一堆去。

SVM,SupportVectorMachine ,issomethingthathassimilarrootswithneuralnetworks.Butrecentlyithasbeenwidelyusedin Classification.Thatmeans,ifIhavesomesetsofthingsclassified （ButyouknownothingaboutHOWICLASSIFIEDTHEM,orsayyoudon'tknowtherulesusedforclassification）,whenanewdatacomes,SVMcan PREDICT whichsetitshouldbelongto.

聽起來是很神奇的事（如果你覺得不神奇，請重想一想這句話代表什麼：

分類的依據是未知的！

，還是不神奇的話就請你寫個程式解解看這個問題），也很像要AI之類的高等技巧...不過SVM基於統計學習理論可以在合理的時間內漂亮的解決這個問題。

ItsoundsmarvelousandwouldseemtorequireadvancedtechniqueslikeAIsearchingorsometime-consumingcomplexcomputation.ButSVMusedsome StatisticalLearningTheory tosolvethisprobleminreasonabletime.

以圖形化的例子來說明（by SVMToy）,像假定我在空間中標了一堆用顏色分類的點,點的顏色就是他的類別,位置就是他的資料,那SVM就可以找出區隔這些點的方程式,依此就可以分出一區區的區域;拿到新的點（資料）時,只要對照該位置在哪一區就可以（predict）找出他應該是哪一顏色（類別）了:

Nowweexplainwithagraphicalexample（by SVMToy）,Imarkedlotsofpointswithdifferentcolorsonaplane,thecolorofeachpointisits"class"andthelocationisitsdata.SVMcanthenfindequationstosplitthesepointsandwiththeseequationswecangetcoloredregions.Whenanewpoint（data）comes,wecanfind（predict）whatcolor（class）apointshouldbejustbyusingthepoint'slocation（data）

原始資料分佈OriginalData

SVM找出來的區域SVMRegions

當然SVM不是真的只有畫圖分區那麼簡單,不過看上面的例子應該可以了解SVM大概在作什麼.OfcourseSVMisnotreallyjustaboutpaintingandmarkingregions,butwiththeexampleaboveyoushouldshouldbeabletogetsomeideaaboutwhatSVMisdoing.

要對SVM再多懂一點點，可以參考cjlin在datamining課的slides:

pdf or ps 。

底下我試著在不用看那個slide的情況解釋及使用libsvm。

TogetyourselfmorefamiliarwithSVM,youmayrefertotheslidescjlinusedinhisDataMiningcourse:

pdf or ps .

I'mgoingtotrytoexplainanduselibSVMwithoutthoseslides.

所以,我們可以把SVM當個黑盒子,資料丟進去讓他處理然後我們再來用就好了.ThuswecanconsiderSVMasablackbox.JustpushdataintoSVMandusetheoutput.

HowdoIgetSVM?

林智仁（cjlin）老師的 libsvm 當然是最完美的工具.Chih-JenLin's libsvm isofcoursethebesttoolyoucaneverfind.

Downloadlibsvm

下載處:

DownloadLocation:

libsvm.zip or libsvm.tar.gz

.zip跟.tar.gz基本上是一樣的,只是看你的OS;習慣上Windows用.zip比較方便（因為有WinZIP,不過我都用WinRAR）,UNIX則是用.tar.gzContentsinthe.zipand.tar.gzarethesame.PeopleusingWindowsusuallyliketouse.zipfilesbecausetheyhaveWinZIP,whichIalwaysreplacewithWinRAR.UNIXusersmostlyprefer.tar.gz

Buildlibsvm

解開來後,假定是UNIX系統,直接打make就可以了;編不出來的話請詳讀說明和運用常識.因為這是tutorial,所以我不花時間細談,而且會編不出來的情形真是少之又少,通常一定是你的系統有問題或你太笨了.其他的子目錄可以不管,只要 svm-train,svm-scale,svm-predict 三個執行檔有編出來就可以了.Afteryouextractedthearchives,justtype make ifyouareusingUNIX.Youmayignoresomeofthesubdirectories.Weonlyneedtheseexecutablefiles:

svm-train,svm-scale,andsvm-predict

Windows的用戶要自己重編當然也是可以,不過已經有編好的binary在裡面了:

請檢查windows子目錄,應該會有 svmtrain.exe,svmscale.exe,svmpredict.exe,svmtoy.exe .Windowsusersmayrebuildfromsourceifyouwant,butthere'realreadysomeprebuiltbinariesinthearchive:

justcheckyour"windows"subdirectoryandyoushouldfind svmtrain.exe,svmscale.exe,svmpredict.exe,andsvmtoy.exe .

UsingSVM

libsvm有很多種用法,這篇tutorial只打算講簡單的部分.libsvmhaslotsoffunctions.Thistutorialwillonlyexplaintheeasierparts（mostlyclassificationwithdefaultmodel）.

Theprograms

解釋一下幾個主要執行檔的作用:

（UNIX/Windows下檔名稍有不同,請用常識理解我在講哪個）I'mgoingtodescribehowtousethemostimportantexecutableshere.ThefilenamesarealittlebitdifferentunderUnixandWindows,applycommonsensetoseewhichI'mreferringto.

svmtrain

Train（訓練）data.跑SVM被戲稱為"開火車"也是由於這個程式名而來.train會接受特定格式的輸入,產生一個"Model"檔.這個model你可以想像成SVM的內部資料,因為predict要model才能predict,不能直接吃原始資料.想想也很合理,假定train本身是很耗時的動作,而train好可以以某種形式存起內部資料,那下次要predict時直接把那些內部資料load進來就快多了.Useyourdatafortraining.RunningSVMisoftenreferredtoas'drivingtrains'byitsnon-nativeEnglishspeakingauthorsbecauseofthisprogram.svmtrainacceptssomespecificallyformatwhichwillbeexplainedbelowandthengeneratea'Model'file.Youmaythinkofa'Model'asastorageformatfortheinternaldataofSVM.Thisshouldappearveryreasonableaftersomethought,sincetrainingwithdataisatime-consumingprocess,sowe'train'firstandstoretheresultenablingthe'predict'operationtogomuchfaster.

svmpredict

依照已經train好的model,再加上給定的輸入（新值）,輸出predict（預測）新值所對應的類別（class）.Outputthe predicted classofthenewinputdataaccordingtoapre-trainedmodel.

svmscale

Rescaledata.因為原始資料可能範圍過大或過小,svmscale可以先將資料重新scale（縮放）到適當範圍.Rescaledata.Theoriginaldatamaybetoohugeorsmallinrange,thuswecanrescalethemtotheproperrangesothattrainingandpredictingwillbefaster.

FileFormat

檔案格式要先交代一下.你可以參考libsvm裡面附的"heart_scale":

ThisistheinputfileformatofSVM.Youmayalsorefertothefile"heart_scale"whichisbundledinofficiallibsvmsourcearchive.

[label] [index1]:

[value1][index2]:

[value2]...

[label] [index1]:

[value1][index2]:

[value2]...

一行一筆資料，如Onerecordperline,as:

+11:

0.7082:

13:

14:

-0.3205:

-0.1056:

-1

label

或說是class,就是你要分類的種類，通常是一些整數。

Sometimesreferredtoas'class',theclass（orset）ofyourclassification.Usuallyweputintegershere.

index

是有順序的索引，通常是放連續的整數。

Orderedindexes.usuallycontinuousintegers.

value

就是用來train的資料，通常是一堆實數。

Thedatafortraining.Usuallylotsofreal（floatingpoint）numbers.

每一行都是如上的結構,意思就是:

我有一排資料,分別是value1,value2,....valueN,（而且它們的順序已由indexN分別指定），這排資料的分類結果就是label。

Eachlinehasthestructuredescribedabove.Itmeans,Ihaveanarray（vector）ofdata（numbers）:

value1,value2,....valueN（andtheorderofthevaluesarespecifiedbytherespectiveindex）,andtheclass（ortheresult）ofthisarrayislabel.

或許你會不太懂，為什麼會是value1,value2,....這樣一排呢？

這牽涉到SVM的原理。

你可以這樣想（我沒說這是正確的），它的名字就叫Support"Vector"Machine，所以輸入的trainingdata是"Vector"（向量）,也就是一排的x1,x2,x3,...這些值就是valueN，而x[n]的n就是由indexN指定。

這些東西又稱為"attribute"。

真實的情況是，大部份時候我們給定的資料可能有很多"特徵（feature）"或說"屬性（attribute）"，所以輸入會是一組的。

舉例來說，以前面畫點分區的例子來說，我們不是每個點都有X跟Y的座標嗎？

所以它就有兩種attribute。

假定我有兩個點：

（0,3）跟（5,8）分別在label（class）1跟2，那就會寫成11:

02:

21:

52:

同理，空間中的三維座標就等於有三組attribute。

Maybeit'sconfusingtoyou:

whyvalue,value2,...?

Thereasonisusuallytheinputdatatotheproblemyouweretryingtosolveinvolveslotsof'features',orsay'attributes',sotheinputwillbeaset（orsayvector/array）.Takethe Markingpointsandfindregion exampledescribedabove,weassumedeachpointhascoordinatesXandYsoithastwoattributes（XandY）.Todescribetwopoints（0,3）and（5,8）ashavinglabels（classes）1and2,wewillwritethemas:

11:

02:

21:

52:

And3-dimensionalpointswillhave3attributes.

這種檔案格式最大的好處就是可以使用sparsematrix，或說有些data的attribute可以不存在。

Thiskindoffileformathastheadvantagethatwecanspecifyasparsematrix,ie.someattributeofarecordcanbeomitted.

ToRunlibsvm

來解釋一下libsvm的程式怎麼用。

你可以先拿libsvm附的heart_scale來做輸入，底下也以它為例：

NowI'llshowyouhowtouselibsvm.Youmayusetheheart_scalefileinthelibsvmsourcearchiveasinput,asI'lldointhisexample:

看到這裡你應該也了解，使用SVM的流程大概就是：

Youshouldh

展开阅读全文