数据库系统实现两阶段多路归并排序算法的C实现.docx-资源下载

数据库系统实现两阶段多路归并排序算法的C实现.docx

1、数据库系统实现两阶段多路归并排序算法的C实现两阶段多路归并排序Two-Phase Multiway Merge-Sort实验报告目录1 实验目的 42 实验内容 43 实验环境 44 实验的设计和实现 44.1 算法描述 44.2 设计思路 54.3 数据结构 64.4 具体实现 75 实验结果 105.1 50MB内存TPMMS实验结果 105.2 10MB内存TPMMS实验结果 105.3 100MB内存TPMMS实验结果 115.4 三者的比较 126 实验遇到的问题和解决方法 126.1 Phase2阶段遇到的问题和解决方法 126.2 生成子记录文件名的方法 147 代码附录 141

2、实验目的通过merge-sort算法的实现，掌握外存算法所基于的I/O模型与内存算法基于的RAM模型的区别；理解不同的磁盘访问优化方法是如何提高数据访问性能的。2 实验内容生成一个具有10,000,000个记录的文本文件，其中每个记录由100个字节组成。实验只考虑记录的一个属性A，假定A为整数类型。记录在block上封装时，采用non-spanned方式，即块上小于一个记录的空间不使用。Block的大小可在自己的操作系统上查看，xp一般为4096 bytes。在内存分配50M字节的空间用于外部merge-sort。要求设计和实现程序完成下列功能：1) 生成文本文件，其中属性A的值随机产生。2

3、) 按照ppt中的方法对文本文件中的记录，按照属性A进行排序，其中在第二阶段的排序中每个子列表使用一个block大小的缓冲区缓冲数据。3) 按照教材cylinder-based buffers(1M bytes)的方法，修改第二阶段的算法。4) 比较两种方法的时间性能，如果有更大的内存空间，算法性能还能提高多少？3 实验环境1) Visual C+ 6.02) Windows 7操作系统4 实验的设计和实现1 2 3 4 4.1 算法描述Two-Phase Multiway Merge-Sort算法的具体描述分为2个阶段，如下所示： Phase 11) Fill main memory wit

4、h records.2) Sort with favorite main memory sorting algorithms.3) Write sorted list to disk.4) Repeat until all records have been put into one of the sorted lists. Phase21) Initially load input buffers with the first block of their respective sorted lists.2) Repeated run a competition among the firs

5、tunchosen records of each of the buffered blocks.3) If an input block is exhausted, get the next block from the same file.4) If the output block is full, write it to disk.4.2 设计思路从上述的算法描述中，我们知道，系统主要由2大模块组成：Phase1和Phase2。Phase1阶段主要将生成的记录文件按内存块大小（本实验中是50MB）分成多个（本实验中是20个）相应的子记录文件，把这些文件中的记录读进内存进行排序，再写回磁

6、盘上。Phase2阶段利用多路归并排序算法，将Phase1阶段已经排好序的子记录文件重新归并为1个有序的记录文件，写回到磁盘上。由于我们在Phase1和Phase2阶段之前必须先生成1个含有10000000个100B记录的文件，所以系统必须再加上1个生成记录文件的GenerateRecord File模块。终上所述，系统由3大模块组成，分别为：Generate Record File、Phase1、Phase2。Phase1模块可以细分为内存块排序模块Main Memory Sort和写回磁盘模块Write To Disk。Phase2模块可以细分为多路归并排序模块Merge-Sort和写回磁

7、盘模块Write To Disk。详细的系统逻辑结构图如图3-1所示：图3-1 TPMMS系统逻辑结构图4.3 数据结构我们讨论单个记录的数据结构。由于1个记录有100个字节，其中4字节是由随机整数组成的主键属性Primary Key，另外96个字节是随意填充的数据content，而且本系统又是由C语言进行实现的，所以我们可以采取结构体来作为记录的数据结构。其中整形字段key记录4字节的主键属性，以便进行排序工作。数组字段contents用来填充剩余的96个字节，内容可以随意（本实验中均为0）。具体的数据结构如图4-1所示：图4-1 单个记录的数据结构4.4 具体实现1 2 3 4 4.1 4

8、.2 4.3 4.4 4.4.1 Generate Record File阶段Generate Record File阶段比较简单，首先打开一个文件，然后生成随机数key并将其写入文件中，再填充96个任意内容的字节（本实验中均为0），即能生成1条完整的记录。重复10000000次，生成我们所需的记录文件。核心代码实现如图4-2所示，其中MAX_RECORD_NUMBER大小为10000000，ROW_NUMBER大小为95。图4-2Generate Record File阶段的实现4.4.2 Phase1阶段Phase1阶段重点在于如何进行内存排序，并写回到磁盘上。这里我们采用了STL的so

9、rt函数帮助我们进行排序。首先读入50MB记录，利用sort函数进行排序后，写到磁盘上生成1个有序的子记录文件。重复进行20次，生成20个相应的子记录文件。核心代码实现如图4-3所示，其中BLOCK_SIZE大小为50M，SUB_LIST_NUMBER大小为20。图4-3 Phase1阶段的实现4.4.3 Phase2阶段Phase2阶段是本系统实现的难点所在。主要的实现大致可以分为以下几部分进行讨论：1) 输入缓冲的实现将Phase1阶段中得到的20个子记录文件的首字符分别读入长度为20的输入缓冲数组inputBuffer，核心代码实现如图4-4所示：图4-4输入缓冲的实现 2) 输出缓

10、冲的实现选取输入缓冲数组inputBuffer中主键属性key最小的那个缓冲区，输入到输出缓冲数组outputBuffer中，然后循环执行，核心代码实现如图4-5所示：图4-5输出缓冲的实现 3) 多路归并排序的实现如果输出缓冲数组outputBuffer已经填满，此时可知输出缓冲是有序的，且之后的主键属性key的值都不会小于该输出缓冲区，这时我们即可将其输出到最后想要得到的结果文件上，核心代码实现如图4-6所示：图4-6多路归并排序的实现4) Phase2阶段的其他实现我们将在“实验中遇到的问题和解决办法”这一章详细讨论Phase2阶段剩下来的难点实现。5 实验结果1 2 3 4 5 5.1

11、 50MB内存TPMMS实验结果采用50MB内存块大小进行TPMMS实验的结果如图5-1所示：图5-1 50MB内存TPMMS实验结果图从上图可以看出，生成1GB大小10000000条记录的文件需要152秒，phase1阶段需要136秒，phase2阶段需要150秒。所以整个排序过程需要286秒，即4分46秒的时间才能完成。5.2 10MB内存TPMMS实验结果我们将50MB内存缩减5倍，进行10MB内存块大小的TPMMS实验。这将产生100个子记录文件。实验结果如图5-2所示：图5-2 10MB内存TPMMS实验结果图生成1GB大小10000000条记录的文件所需时间不变，仍为152秒左右。

12、我们注重于phase1阶段和phase2阶段的所需时间。从图中可以看出，phase1阶段需要147秒，phase2阶段需要152秒。整个排序过程需要300秒，即5分钟的时间才能完成。5.3 100MB内存TPMMS实验结果我们再将50MB内存增加2倍，进行100MB内存块大小的TPMMS实验。这将产生10个子记录文件。实验结果如图5-3所示：图5-3 100MB内存TPMMS实验结果图生成1GB大小10000000条记录的文件所需时间不变，仍为152秒左右。我们注重于phase1阶段和phase2阶段的所需时间。从图中可以看出，phase1阶段需要124秒，phase2阶段需要130秒。整个排

13、序过程需要254秒，即4分14秒的时间才能完成。5.4 三者的比较从上面的实验结果，我们可以很明显地看出，内存块大小越大，算法所需时间越少。这是因为内存块越小，生成的子记录文件个数就越多，这样phase1阶段生成子记录文件的时间就增加了。并且这还使得phase2阶段的输出缓冲区变小，导致多路归并时程序读写磁盘的次数增多，所以phase2阶段时间也增加了。这样整个排序过程时间当然增加。终上所述，当在理想条件下，我们应使用内存块大小较大的方法来进行TPMMS算法的实现。在本章中TPMMS算法的性能为：100MB优于50MB优于10MB。所以在可能的情况下，应该考虑采纳100MB的TPMMS算法。6

14、实验遇到的问题和解决方法1 2 3 4 5 6 6.1 Phase2阶段遇到的问题和解决方法前文已经详细描述了Phase2阶段的3个主要的实现阶段，但是仅仅依靠这3个阶段还不能完全实现Phase2阶段，必须解决以下几个关键问题才能完成Phase2阶段的所有任务。6.1.1 读完某个子记录文件后，输入缓冲的填充方法当某个输入缓冲数组inputBufferi相对应的子记录文件infpi已经读完时，我们就必须重新查找其余可用的子记录文件，按数组下标i搜索到第一个可用的文件infpk后，将它的第一个字节继续填充到输入缓冲数组inputBufferi中。特别的，当数组下标i超过子记录文件总数SUB_L

15、IST_NUMBER（本实验中为20）时，我们就认为所有子记录文件已经读取完毕，这时可以设置一个bool型变量flag = true，进行标识。核心代码实现如图6-1所示：图6-1 读完某个子记录文件后，输入缓冲的填充方法6.1.2 读完所有子记录文件后，处理最后一组输入缓冲数据的方法利用在6.1.1中设置的bool型变量flag，当flag=true时，我们知道子记录文件已经全部读取完毕。这时在输入缓冲数组inputBuffer中只剩下最后一组数据，并且根据Phase2阶段的定义，它们肯定比之前输入缓冲中的数据要大。所以我们只需利用STL提供的sort函数对它们进行排序后，直接输出到最终结果

16、文件即可。核心代码实现如图6-2所示：图6-1 读完所有子记录文件后，处理最后一组输入缓冲数据的方法6.2 生成子记录文件名的方法当我们生成子记录文件时，想要赋予文件类似于record_k.txt (k = i+1, 0=i= 19)的文件名。由于在C语言中，不支持字符串和整数的直接连接。在这里我们需要一个generateFileName函数，采用itoa函数将整数k = i+1转换成字符串，再连接到“record_”后面，从而得到想要的文件名。核心代码实现如图6-3所示：图6-3 生成子记录文件名的方法7 代码附录#include / for sort function#include /

17、for strcpy#include / for fscanf, fprintf, fopen#include / for clockusing namespace std;/* define the constants used in this program */constint MAX_RECORD_NUMBER = 10000000; / max record numberconstint BLOCK_SIZE = 500000; / main memory block sizeconstint ROW_NUMBER = 95; / for record to fill the oth

18、er 96 bytesconstint SUB_LIST_NUMBER = ( MAX_RECORD_NUMBER / BLOCK_SIZE ); / sub list numberconstint MAX = 99999999; / for function selectMinRecord to initialize the variable min /* the data structrue of a record */typedefstruct record int key; / primary key char contentsROW_NUMBER + 1; / contentReco

19、rd;Record subRecordBLOCK_SIZE; / main memory bufferRecord inputBufferBLOCK_SIZE + 1; / input buffer Record outputBufferBLOCK_SIZE + 1; / output buffer/* generate a file of MAX_RECORD_NUMBER (= 10000000) records, every record is 100 bytes */void generateFile( string fileName ) / calculate time printf

20、(The records is now under generating .n); clock_t start, finish; double duration; start = clock(); / start time / open file FILE *fp = fopen( fileName.c_str(), w ); if ( !fp ) / open failed printf(File could not be created!n); fprintf( stderr, File could not be created!n ); exit( 1 ); / generate ran

21、dom integers and records srand( (unsigned)time( NULL ) ); / srand seed for ( inti = 0; i 0 ) fprintf( fp, n ); int key = rand(); / primary key, random integer, 4 bytes / write record to disk, every record has 100 bytes fprintf( fp, %d , key ); / write key as the first 4 bytes for ( int j = 0; j ROW_

22、NUMBER; j + ) / write 0 for content as the other 96 bytes fprintf( fp, %c, 0 ); fclose( fp ); / close output file / calculate time finish = clock(); / finish time duration = (double)( finish - start ) / CLOCKS_PER_SEC; / run time printf ( It takes %f seconds to genetate the whole records.n, duration

23、 );/* use for phase 1 of two phase multiway merge sort compare two record by primary key, with ascending order */boolcmp( const Record &r1, const Record &r2 ) return r1.key r2.key;/* give an integer, to generate a file name */string generateFileName( inti ) char str20; / temporary charater array str

24、ing temp = ; / temporary string itoa( i+1, str, 10 ); / store integer k+1 to array str temp = str; / convert array str to temporary string temp = D:/record_ + temp + .txt; / form the file name return temp; / return the temporary string of file name /* phase 1 of two phase multiway merge sort read re

25、cord with maximum block size to main memory and sort them by primary key */void phase1( string fileName ) / open file FILE *infp = fopen( fileName.c_str(), r ); if ( !infp ) / open failed printf( File %s could not be opened!n, fileName.c_str() ); fprintf( stderr, File %s could not be opened!n, fileN

26、ame.c_str() ); exit( 1 ); string temp = ; / temporary string inti = 0, j = 0; / calculate time printf( The sorted list of records is now under generating .n ); clock_t start, finish; double duration; start = clock(); / start time char strROW_NUMBER + 10; / read all records to main memory for ( int k

27、 = 0; k SUB_LIST_NUMBER; k + ) / read records of a block size to main memory for ( i = 0; i BLOCK_SIZE; i + ) fgets( str, ROW_NUMBER + 10, infp ); sscanf( str, %d %s, &subRecordi.key, subRecordi.contents ); / use STL algorithm sort to sort records sort( subRecord, subRecord + BLOCK_SIZE, cmp ); temp

28、 = generateFileName( k ); / sorted list name FILE *outfp = fopen( temp.c_str(), w ); / open output file if ( !outfp ) / open failed printf( File %s could not be opened!n, temp.c_str(); fprintf( stderr, File %s could not be opened!n, temp.c_str() ); exit( 1 ); / write the sorted records to sub list f

29、ile for ( i = 0; i 0 ) fprintf( outfp, n ); fprintf( outfp, %d %s, subRecordi.key, subRecordi.contents ); printf( The sorted list %s generated successfully!n, temp.c_str() ); fclose( outfp ); / close output stream fclose( infp ); / close input file / calculate time finish = clock(); / finish time du

30、ration = (double)( finish - start ) / CLOCKS_PER_SEC; / run time printf( It takes %f seconds to genetate the sorted list of records.n, duration );/* copy record r2 to record r1 */void copyRecord( Record &r1, Record &r2 ) r1.key = r2.key; strcpy( r1.contents, r2.contents );/* copy a record to input buffer */void copyToInputBuf

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？