1、文献翻译数据类型泛化用于数据挖掘算法英 文 翻 译系 别专 业班 级学生姓名学 号指导教师Data Types Generalization for Data Mining AlgorithmsAbstract With the increasing of database applications, mining interesting information from huge databases becomes of most concern and a variety of mining algorithms have been proposed in recent years. As
2、 we know, the data processed in data mining may be obtained from many sources in which different data types may be used. However, no algorithm can be applied to all applications due to the difficulty for fitting data types of the algorithm, so the selection of an appropriate mining algorithm is base
3、d on not only the goal of application, but also the data fittability. Therefore, to transform the non-fitting data type into target one is also an important work in data mining, but the work is often tedious or complex since a lot of data types exist in real world. Merging the similar data types of
4、a given selected mining algorithm into a generalized data type seems to be a good approach to reduce the transformation complexity. In this work, the data types fittability problem for six kinds of widely used data mining techniques is discussed and a data type generalization process including mergi
5、ng and transforming phases is proposed. In the merging phase, the original data types of data sources to be mined are first merged into the generalized ones. The transforming phase is then used to convert the generalized data types into the target ones for the selected mining algorithm. Using the da
6、ta type generalization process, the user can select appropriate mining algorithm just for the goal of application without considering the data types.1. Introduction In recent years, the amount of various data grows rapidly Widely available, low-cost computer technology now makes it possible to both
7、collect historical data and also institute on-line analysis for newly arriving data. Automated data generation and gathering leads to tremendous amounts of data stored in databases Although we are filled with data, but we lack for knowledge. Data mining is the automated discovery of non-trivial, pre
8、viously unknown, and potentially useful knowledge embedded in databases. Different kinds of data mining methods and algorithms have been proposed,each of which has its own advantages and suitable application domains. However, it is difficult for users to choose an appropriate one by themselves.to ch
9、oose an appropriate one by themselves. This is because the data provided can not be directly used for data mining algorithms. Since most data mining algorithms can only be applied to some specific data types, the types of data stored in databases restricts the choice of data mining methods. If certa
10、in kinds of knowledge need to be obtained using some data mining algorithms, data types transformation should be done first and this is what we called“the data types fittability problem”for data mining. For the time being, there is no tool that can help users to do this kind of data types transforma
11、tion. In this paper, we will survey and analyze the data types fittability problem for data mining algorithms, and then we propose a“data types generalization process”to solve the data types fittability problem for the attributes in relational databases.The “data types generalization process” includ
12、ing merging and transforming phases is a procedure to transform the data types of atttributes contained in relations (tables). In the merging phase, the original data types of data sources to be mined are first merged into the generalized ones. The transforming phase is then used to convert the gene
13、ralized data types into the target ones for the selected mining algorithm. Using the data type generalization process, the user can select appropriate mining algorithm just for the goal of application without considering the data types.2. Related work As mentioned above, because many data mining alg
14、orithms can only be applied to the data types with restricted range, users possibly need to do data types transformation before the selected algorithm has been executed. In this paper, we propose a general concept called “data types generalization process“ which provide a procedure for doing this ki
15、nd of data types transformation. Data types generalization can be seen as a pre-processing of data mining. Of course, other pre-processing such as data selection, data cleaning, dimension (attribute) reduction, missing data handling may also need to be performed before running the selected data mini
16、ng algorithm. In summary, the whole process of data mining is the so-called KDD (knowledge discovery in databases), as shown in Figure 1.Figure 1: The KDD process and the role of data types generalization. There is a major difference between the data types generalization process and other data minin
17、g pre-processes. Other pre-processes (like missing value handling) are all independent of the selected data mining method. That is, they can be done without knowing what data mining algorithm will be used. But it is clear that data types generalization process depends on the desired mining method. T
18、he target of doing data transformation using data types generalization is to make the specified data set suitable for the mining algorithm. Therefore, if we want to achieve this goal, we must survey both the data types in databases and their relations with various data mining methods. The flow of so
19、lving a data mining problem with doing data transformation is illustrated in Figure 2.Figure 2: Solving data mining problems with data transformation data types transformationSome researchers proposed how to generalize the data contained in attributes using attribute-oriented induction which allows
20、the generalization of data, offers two major advantages for the mining of large databases. First, it allows the raw data to be handled at higher conceptual levels. Generalization is performed with the use of attribute concept hierarchies, where the leaves of a given attributes concept hierarchy corr
21、e- spond to the attributes values in the data (referred to as primitive level data ). Generalization of the training data is achieved by replacing primitive level data by higher level concepts.In fact, data generalization using attribute concept hierarchies is a kind of data type transformation whic
22、h reduces the number of distinct values contained in attributes. We first provide a typical description of the data types fittability problem and a data types generalization process to define and solve the data types transformation problem for attributes. Hence, data generalization using concept hie
23、rarchies is included in the process for performing specified data types transformation. Another related work is that some researchers surveyed about how to transform data in to numerical values. Almost all data-driven algorithms utilize numeric inputs. From a computer processing point of view, handl
24、ing computations with numbers is easier and more efficient. Therefore, if the input values are non-numeric(e.g., text strings), they should be intelligently converted to meaningful numerical values in many cases. Numerical values can be seen as a data type and transforming data into numerical values
25、 is a kind of data types transformation.The strategies are included in the data types generalization process for performing data types transformation. 3. Analysis of the data types fittability problem In recent years, due to the explosion of information and the rapid growth of database applications,
26、 data mining techniques become more and more important. For this reason, different kinds of data mining methods or algorithms have been proposed. However, it is difficult for users to choose a suitable one by themselves without prior knowledge about data mining. Actually, the kind of data mining met
27、hods should be applied depends on both the characteristic of the data to be mined and the kind of knowledge to be found through the data mining process. Hence, the types of data stored in databases play an important role during the data mining process and restrict the data mining methods can be chos
28、en by users. It is true that all kinds of data mining methods can only be applied to particular databases suitable for each kind and this is what we called the data types fittability problem for data mining. To solve this problem, we need to investigate the relationships between the characteristics
29、of the data to be mined and various kinds of data mining techniques. With the relation- ships, we can clearly analyze the data types fittability problem and further know whether the data types transformation can be performed or not. Hence, analyzing this kind of relationships is a preparation work f
30、or our data types generalization process, which explains why the data types generalization process can solve the data fittability problem. We now illustrate the analysis as follows. 3.1 Four kinds of data forms for data mining Data mining techniques ususally can be applied to four kinds of data form
31、s: texual, temporal, transactional and relational forms. Different kinds of data forms are used to store different kinds of data types. We describe each kind of data forms in the following:(1) Textual data forms : Textual data forms are used to represent texts or documents. Basically, this kind of d
32、ata forms can be seen as a set of characters with huge amount. (2) Temporal data forms : Time-series data is stored in temporal data forms. Data that varies with time (such as historical data) can be stored in the form of numerical time-series.(3) Transactional data forms : For example, the past tra
33、nsactions of a market can be stored in transactional data forms. Each transaction records a list of items bought in that transaction. (4) Relational data forms : Relational data forms are the most widely used data forms and can store diffierent kinds of data. The basic units of relational data forms are relations(
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1