1、如何选择机器学习演算法 Microsoft Docs如何選擇機器學習演算法 Microsoft Docs 如何選擇 Microsoft Azure Machine Learning 的演算法How to choose algorithms for Microsoft Azure Machine Learning 18/12/2017 我該使用何種機器學習演算法?The answer to the question What machine learning algorithm should I use? 的答案永遠都是視情況。is always It depends. 這可視資料的大小、品質和

2、本質而定。It depends on the size, quality, and nature of the data. 也可取決於您想用這個答案來做些什麼。It depends on what you want to do with the answer. 或是取決於演算法的數學運算如何針對您正在使用的電腦轉譯成指令。It depends on how the math of the algorithm was translated into instructions for the computer you are using. 而這又需視您有多少時間。And it depends on

3、 how much time you have. 即使經驗最豐富的資料科學家,在沒有嘗試之前,也無法確認哪一個演算法效果會最好。Even the most experienced data scientists cant tell which algorithm will perform best before trying them.機器學習演算法小祕技The Machine Learning Algorithm Cheat SheetMicrosoft Azure Machine Learning 演算法小祕技 可協助您從 Microsoft Azure Machine Learning

4、演算法資源庫中選擇適合您預測性分析解決方案的機器學習演算法。The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right machine learning algorithm for your predictive analytics solutions from the Microsoft Azure Machine Learning library of algorithms.本文將引導您如何使用它。This article walks you through how to use

5、 it.注意若要下載小祕技,並搭配本文使用,請移至 適用於 Microsoft Azure Machine Learning Studio 的機器學習演算法小祕技。To download the cheat sheet and follow along with this article, go to Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio.請記住,這份小祕技有非常特定的預設對象:一位剛起步的資料科學家,其機器學習的經驗為大學生程度,正試著在 Azure Machine

6、 Learning Studio 中選擇要開始使用的演算法。This cheat sheet has a very specific audience in mind: a beginning data scientist with undergraduate-level machine learning, trying to choose an algorithm to start with in Azure Machine Learning Studio. 這表示小祕技可能會比較概括且過於簡化,但它為您指引一個可靠的方向。That means that it makes some gene

7、ralizations and oversimplifications, but it points you in a safe direction. 同時這也意味著還有許多演算法並未列入其中。It also means that there are lots of algorithms not listed here. 當 Azure 機器學習成長到擁有一組更完整的可用方法時,我們就會新增這些演算法。As Azure Machine Learning grows to encompass a more complete set of available methods, well add t

8、hem.這些建議是收集許多資料科學家與機器學習專家的意見反應和提示所編撰而成。These recommendations are compiled feedback and tips from many data scientists and machine learning experts. 雖然我們的想法並不一致,但我已試著將我們的意見整理成粗略的共識。We didnt agree on everything, but Ive tried to harmonize our opinions into a rough consensus. 而大部分的爭論其實都具有同一個考量:視情況而定。Mo

9、st of the statements of disagreement begin with It depends如何使用小祕技How to use the cheat sheet請將圖表上的路徑和演算法標籤解讀為如果需要則使用。Read the path and algorithm labels on the chart as For , use . 例如如果需要 speed (速度),則使用 two class logistic regression (雙類別羅吉斯迴歸)。For example, For speed, use two class logistic regression.

10、 有時候適用於多個分支。Sometimes more than one branch applies.有時候則不完全適用。Sometimes none of them are a perfect fit. 這些建議通常是來自經驗法則,因此不必擔心是否準確。Theyre intended to be rule-of-thumb recommendations, so dont worry about it being exact.我和一些資料科學家討論過,他們都認為唯有全部試用一次,才能找出最佳的演算法。Several data scientists I talked with said tha

11、t the only sure way to find the very best algorithm is to try all of them.以下是 Azure AI 資源庫中的實驗範例,該實驗對相同的資料嘗試數種演算法,並比較其結果:比較多類別分類器:字母辨識。Heres an example from the Azure AI Gallery of an experiment that tries several algorithms against the same data and compares the results: Compare Multi-class Classif

12、iers: Letter recognition.提示若要下載並列印提供 Machine Learning Studio 功能概觀的圖表,請參閱 Azure Machine Learning Studio 功能的概觀圖。To download and print a diagram that gives an overview of the capabilities of Machine Learning Studio, see Overview diagram of Azure Machine Learning Studio capabilities.機器學習的類型Flavors of ma

13、chine learning監督式Supervised監督式學習演算法會根據一組範例做出預測。Supervised learning algorithms make predictions based on a set of examples. 例如,利用歷史股價來大膽猜測未來的價格。For instance, historical stock prices can be used to hazard guesses at future prices. 用於定型的各個範例都會標上需要關注的值,在這裡指的就是股價。Each example used for training is labeled

14、 with the value of interestin this case the stock price. 監督式學習演算法會在這些值標籤中尋找模式。A supervised learning algorithm looks for patterns in those value labels. 它可以使用任何可能相關的資訊 (星期幾、季度、公司的財務資料、產業類型、是否有破壞性的地緣政治事件等),然後每個演算法就會尋找不同類型的模式。It can use any information that might be relevantthe day of the week, the sea

15、son, the companys financial data, the type of industry, the presence of disruptive geopolitical eventsand each algorithm looks for different types of patterns. 當演算法找到最佳模式之後,它會使用這種模式為沒有標示的測試資料 (也就是未來的股價) 做出預測。After the algorithm has found the best pattern it can, it uses that pattern to make predicti

16、ons for unlabeled testing datatomorrows prices.監督式學習是常見且實用的機器學習類型。Supervised learning is a popular and useful type of machine learning. 除了一個例外之外,Azure Machine Learning 中的所有模組都是監督式學習演算法。With one exception, all the modules in Azure Machine Learning are supervised learning algorithms. Azure 機器學習中有幾個代表性

17、的特定監督式學習類型:分類、迴歸和異常偵測。There are several specific types of supervised learning that are represented within Azure Machine Learning: classification, regression, and anomaly detection.分類。Classification. 當資料用來預測類別時,這種監督式學習也稱為分類。When the data are being used to predict a category, supervised learning is al

18、so called classification. 將影像指定為 cat 或 dog 的圖片便屬這種情況。This is the case when assigning an image as a picture of either a cat or a dog. 如果只有兩個選擇,則稱作雙類別或二項式分類。When there are only two choices, its called two-class or binomial classification. 如果有多個類別,例如預測 NCAA 季後賽的優勝隊伍,則這個問題就稱為 多類別分類。When there are more c

19、ategories, as when predicting the winner of the NCAA March Madness tournament, this problem is known as multi-class classification.迴歸。Regression. 如果要預測值,例如股價,這種監督式學習稱為迴歸。When a value is being predicted, as with stock prices, supervised learning is called regression.異常偵測。Anomaly detection. 有時候它的目的只是要

20、找出異常的資料點。Sometimes the goal is to identify data points that are simply unusual. 例如在偵測詐騙時,只要是極不尋常的信用卡消費模式都有嫌疑。In fraud detection, for example, any highly unusual credit card spending patterns are suspect. 由於詐騙可能產生的變化過多,而定型的範例過少,因此難以學習何謂詐騙活動。The possible variations are so numerous and the training exa

21、mples so few, that its not feasible to learn what fraudulent activity looks like. 異常偵測採用的方法,只能使用非詐騙交易的歷史記錄來了解何謂正常活動,並找出與正常活動明顯不同的情況。The approach that anomaly detection takes is to simply learn what normal activity looks like (using a history non-fraudulent transactions) and identify anything that is

22、 significantly different.未監督式Unsupervised在未監督的學習中,資料點沒有與其相關聯的標籤。In unsupervised learning, data points have no labels associated with them. 然而,未經指導的學習演算法的目標在於以某種方式組織資料或描述其結構。Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure. 這種方式可

23、能是將資料劃分為叢集,或尋找各種查看複雜資料的方式,讓資料變得更簡單或更整齊。This can mean grouping it into clusters or finding different ways of looking at complex data so that it appears simpler or more organized.增強式學習Reinforcement learning在增強式學習中,演算法需要選擇一個動作來回應每個資料點。In reinforcement learning, the algorithm gets to choose an action in

24、 response to each data point. 此學習演算法也會在短時間內收到獎勵訊號,指出決策的好壞程度。The learning algorithm also receives a reward signal a short time later, indicating how good the decision was.演算法會據此修改其策略,以達到最高的獎勵。Based on this, the algorithm modifies its strategy in order to achieve the highest reward. Azure 機器學習中目前沒有增強式

25、學習演算法模組。Currently there are no reinforcement learning algorithm modules in Azure Machine Learning. 增強式學習是機器人領域中的常見方法,其中在某個時間點的感應器讀數集就是一個資料點,而演算法必須選擇機器人的下一個動作。Reinforcement learning is common in robotics, where the set of sensor readings at one point in time is a data point, and the algorithm must ch

26、oose the robots next action. 它的性質也很適合物聯網應用。It is also a natural fit for Internet of Things applications.選擇演算法時的考量Considerations when choosing an algorithm精確度Accuracy您不一定常常需要取得最準確的答案。Getting the most accurate answer possible isnt always necessary.視您的用途而定,有時候近似值便已足夠。Sometimes an approximation is adequ

27、ate, depending on what you want to use it for. 如果是這樣,您就能採用近似法,並大幅縮短處理時間。If thats the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. 近似法的另一項優點是,它們會自然傾向於避免 過度學習。Another advantage of more approximate methods is that they naturally tend to avoid

28、 overfitting.定型時間Training time定型出一個模型可能需要幾分鐘或幾小時,這在各個演算法間有很大的差異。The number of minutes or hours necessary to train a model varies a great deal between algorithms. 定型時間通常取決於精確度,這兩者的關係密不可分。Training time is often closely tied to accuracyone typically accompanies the other. 此外,有些演算法對資料點的數目較為敏感。In additio

29、n, some algorithms are more sensitive to the number of data points than others.如果有時間限制,就可以促使演算法做出選擇 (尤其是資料集很大時)。When time is limited it can drive the choice of algorithm, especially when the data set is large.線性Linearity許多機器學習演算法都會使用線性。Lots of machine learning algorithms make use of linearity. 線性分類演

30、算法會假設可以直線 (或較高維度類比) 分隔類別。Linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog). 這些演算法包括羅吉斯迴歸和支援向量機器 (如同 Azure 機器學習中所實作)。These include logistic regression and support vector machines (as implemented in Azure Machine Learning).線性迴歸演

31、算法會假設資料趨勢依循著一條直線。Linear regression algorithms assume that data trends follow a straight line. 這類假設對某些問題而言還不錯,但在其他問題上會降低精確度。These assumptions arent bad for some problems, but on others they bring accuracy down.非線性類別界限 - 依賴線性分類演算法會造成低精確度的結果Non-linear class boundary - relying on a linear classification

32、 algorithm would result in low accuracy具有非線性趨勢的資料 :使用線性迴歸方法會產生較大且不必要的誤差Data with a nonlinear trend - using a linear regression method would generate much larger errors than necessary儘管有風險,線性演算法對於首次攻擊而言仍是一種非常熱門的方式。Despite their dangers, linear algorithms are very popular as a first line of attack. 這種

33、演算法定型起來通常又快又簡單。They tend to be algorithmically simple and fast to train.參數數目Number of parameters參數是資料科學家在設定演算法時的必經之路。Parameters are the knobs a data scientist gets to turn when setting up an algorithm. 參數就是會影響演算法行為的數值,例如容錯或反覆運算次數,或是演算法運作方式的變化選項。They are numbers that affect the algorithms behavior, such as error tolerance or number of iterat

