Abstract:
As a result of the enormous volume of data produced by highly developed modern
techniques, focus on clustering biological data has shown a great interest among
biologist to detect the underlying patterns in data since the biological experiment
itself has failed to identify the hidden information and divergence patterns exist
in data correctly.
This study aims to (1) assist clustering biologically similar sequences to detect
divergence patterns exist in rice genomic data, by developing a program using
the model based clustering algorithm based on Chinese restaurant process which
was originally proposed to cluster gene expression data (2) focus on nding the
performance of calculating the pairwise distance matrix of rice genome sequences
based on the 12-dimensional natural vector of the DNA sequence, as the similarity
measure in cluster analysis.
The developed program based on the proposed model based clustering method was
executed on ALFP pro le data set consisting features of 53 Sri Lankan traditional
and wild rice varieties in order to identify the genetic divergence among them.
Both a statistical and a biological cluster evaluation were carried out to validate
the results obtained. Statistical evaluation was done based on the Bayes ratio to
measure the tightness of the clusters formed. Biological evaluation was conducted
with the help of the domain experts and research work done by the institute of
rice of Sri Lanka. The results showed that the proposed algorithm is capable of
identifying highly similar varieties of rice showing their divergence patterns.
Finding the performance of how well the natural vector method captures the
information encoded in rice genome sequences, 10 rice disease resistance genes
which belong to three di erent protein families from Rice genome annotation
project database were used. The results showed that the pairwise distance matrix
calculated based on 12-dimensional natural vector method gives e cient results
compared to traditional proximity matrices. It also revealed that the xed length
size sequences (sub sequences) which are not greater than the minimum total
length of the selected sequences are also highly capable of capturing the encoded
information in total length, regardless of the sub sequence length.