A Model based approach for cluster traditional rice varieties of Sri Lanka

Silva, MDRL

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Philosophy (M.Phil.)
→
View Item

A Model based approach for cluster traditional rice varieties of Sri Lanka

Silva, MDRL

URI: http://dl.lib.mrt.ac.lk/handle/123/12095

Abstract:

As a result of the enormous volume of data produced by highly developed modern techniques, focus on clustering biological data has shown a great interest among biologist to detect the underlying patterns in data since the biological experiment itself has failed to identify the hidden information and divergence patterns exist in data correctly. This study aims to (1) assist clustering biologically similar sequences to detect divergence patterns exist in rice genomic data, by developing a program using the model based clustering algorithm based on Chinese restaurant process which was originally proposed to cluster gene expression data (2) focus on nding the performance of calculating the pairwise distance matrix of rice genome sequences based on the 12-dimensional natural vector of the DNA sequence, as the similarity measure in cluster analysis. The developed program based on the proposed model based clustering method was executed on ALFP pro le data set consisting features of 53 Sri Lankan traditional and wild rice varieties in order to identify the genetic divergence among them. Both a statistical and a biological cluster evaluation were carried out to validate the results obtained. Statistical evaluation was done based on the Bayes ratio to measure the tightness of the clusters formed. Biological evaluation was conducted with the help of the domain experts and research work done by the institute of rice of Sri Lanka. The results showed that the proposed algorithm is capable of identifying highly similar varieties of rice showing their divergence patterns. Finding the performance of how well the natural vector method captures the information encoded in rice genome sequences, 10 rice disease resistance genes which belong to three di erent protein families from Rice genome annotation project database were used. The results showed that the pairwise distance matrix calculated based on 12-dimensional natural vector method gives e cient results compared to traditional proximity matrices. It also revealed that the xed length size sequences (sub sequences) which are not greater than the minimum total length of the selected sequences are also highly capable of capturing the encoded information in total length, regardless of the sub sequence length.

Show full item record