1、Data Mining Techniques on Course Selection System Application Data Mining Techniques on Course Selection System Application 文档副标题Abstract: This technique report mainly talks about the course selection system application based on data mining techniques. The application contains some specific function
2、s including association analysis, classification, clustering and outlier detection. In the application, these techniques are used based on the application users requirement, specification and what aspect they are interested in. Moreover, they main user of the application is the people who arrange th
3、e course information such as course information registrar (选课管理员). In general, the goal of the application is to provide the course information registrar with detail information about which course should be added, reduced or canceled, course popularity analysis, operation on course and etc. In this
4、report, some background knowledge will be introduced firstly and how to apply these knowledge into the application will be explained in detail, that is, details of the design and implementation approaches will be specified. Finally, conclusion will be drawn at the end of report.Key word: data mining
5、, course selection, application 1 IntroductionData mining is an indispensable step in the process of knowledge-discovery in database. In general, data mining is a process that search for some hidden information from massive data by using some specific algorithm. Data mining is usually related to som
6、e subjects such as computer science and etc. By using statistics, online analytical processing, information retrieval, machine learning, expert system (depend on the past rule of thumb), pattern recognition and etc. to achieve these goals above. By associating with example in real life, the data min
7、ing techniques can be used in widely range. In this example, course selection is a common sense in college. However, different students have different preferences on the course selection which leads to some professors have more students than expected or less student than expected. If the number of s
8、tudents is much more than expected. The professor must apply for a bigger classroom or open another course in different time. On the other hand, if the number students is less than expected. The professor must apply for a smaller classroom or even cancel the course. In this way, professor and studen
9、t can make the most of teaching resource. All the decision are made by course information registrar who can use this course selection system application to achieve this goal. All in all, the application helps course information registrar to arrange the course in a more efficient way.2 Existing Datab
10、ase Existing Database means that the course information registrar have the access to these data, that is, these are raw data for registrar to use this application to analyze. Student Information: Including student personal information, student name, gender, date of birth, student ID, class number, m
11、ajor, grade, selection of course and etc. Professor Information: Including professor personal information, professor name, gender, date of birth, professor ID, course to teach grade and etc. Classroom Information: Including classroom information, class room number, size, available time and etc. Cour
12、se Information: Including course information, course name, course number, course hours, credit and etc. History Record: Record some data about the course that professor taught including student list, professor list and class list of every course in history which are massive of data to be stored in d
13、atabase to be analyzed.The relationship above can be describe as the graph as follow:3 Data Mining Techniques 3.1 Association Analysis3.1.1 TheoryThe main aspect of association analysis frequent pattern analysis which is a pattern (a set of item, subsequences, sub structures and etc.) that occurs fr
14、equently in a data set. The theorem is first proposed in the context of frequent item sets and association rule mining which aims at finding inherent regularities in data. Some applications such as basket data analysis, cross-marking, catalog design, sale campaign analysis, web log (click stream) an
15、alysis, and DNA sequence analysis are related to frequent pattern analysis. More detail, one of the scalable mining method is called Apriori algorithm. The algorithm can be described as a flow chart as follow:By using the Apriori algorithm, n-frequent pattern can be found as n changes while n is the
16、 number of item sets. For example, by analyzing the item bought of a supermarket, run the Apriori algorithm and the clerk will find that people will buy beer and diaper at the same time which help the manager of supermarket to adopt some sale strategy to promote the sale volume. Similarly, the Aprio
17、ri algorithm also apply for the scenario of course selection system application.3.1.2 ImplementationIn the course selection system application, the course information registrar should login to the system firstly and then choose data mining techniques and raw database. In this situation, the data min
18、ing techniques is association analysis and the database is student and course database. More detail, the registrar run the Apriori algorithm based on student and course database and set the item sets as he wants, for example two item sets. Then, the application will show a lot of frequent patterns r
19、anked from maximum number to minimum number of patterns which help the registrar to analyze the data and make decision. For example, the maximum number of patterns means that these two course are most related which indicate that majority of students will select these two course at the same time. Mor
20、eover, this will implies the registrar to arrange these two course adjacent in the course selection system application for student to select conveniently. For example, the registrar finds that 60% students will select Advanced Mathematical and Mathematical Modelling at the same time. By analyzing th
21、e result, the registrar should arrange these two courses at the same web page for student to select from which brings a lot of convenience for student to select relative course.3.2 Classification 3.2.1 TheoryThe classification is also called supervised learning which the training data such as observ
22、ations, measurement and etc. are accompanied by labels indicating the class of the observations. Moreover, new data is classified based on the training set. Before use the classification theory, categorical class labels (discrete or nominal) should be predicted. Moreover, data (constructs a model) b
23、ased on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data should also be classified. One way to classify the data is the decision tree. Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision
24、 tree is a flow chart like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The top-most node in a tree is the root node. The basic algorithm of decision
25、tree is the greedy algorithm. Tree is constructed in a top-down recursive divided-and-conquer manner. At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they are discretized in advance). Examples are partitioned recursively based on selected attrib
26、utes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g. information gain). Condition for stopping partitioning is as follow: All samples for a given node belong to the same class. There are no remaining attributes for further partitioning-majority voting is employ
27、ed for classifying the leaf. There are no samples left.However, how to select suitable attribute is based on the information gain which is calculated based on the steps as follow: Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|. Expected informati
28、on (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A:By calculating the information gain, select the attribute with the highest information gain, nevertheless, such an approach mini
29、mizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found. All in all, the decision tree is suitable in the course selection information. 3.2.2 ImplementationIn the course selection system application, different
30、course may have different student to select from. By analyzing the student attribute and constructing a decision tree respectively, some patterns can be found which is helpful for analyzing. The training data set is the student database. Firstly, calculate the information gain based on each attribut
31、e and select the attribute with the highest information gain. Then, run the algorithm for decision tree and classify the data. More details, by calculating the information gain, the attribute major has highest information gain, that is, the major is the root of the decision tree and recalculate the
32、information gain, and find highest information gain and so on. The process can be described as flow chart as flow:In this application, by running the decision tree algorithm, some patterns can be find. For example, for each course, such as java, the decision tree can be drawn based on the algorithm. The decision tree is as follow:In this diagram, root is major which has the highest information gain, that is, student whose major is computer science will select the java course. Student whose major is language will depend on gender. If gender is male, the student will select the c
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1