Abstract:
A heterogeneous data ensemble approach for the classification of Saccharomyces cerevisiae proteins under ‘mitochondrion organization’ Proteins are the real role players in keeping a cell healthy and well functioning. An important group of proteins is the subset of mitochondrial proteins that engage in the assembly, arrangement and disassembly of the mitochondrion. Several of them have been identified to cause human diseases. Hence, annotating proteins under the ‘mitochondrion organization’ Biology process is vital for identifying disease causative factors and for designing therapeutics. As manual annotation requires costly and laborious in vitro methods, in silico function prediction is preferred nowadays. Recent studies identify the importance of incorporating data from various biological aspects, to formulate a strong functional context for classification. In addition, many approaches from literature employ ensemble classifiers to attain a higher prediction accuracy. However, an insightful approach for accurate classification; biological data utilization; and biological data type significance determination; is still in need. This study presents an assessment of a heterogeneous data ensemble to classify Saccharomyces cerevisiae proteins under ‘mitochondrion organization’. The ensemble consists of nine euclidean-distance based nearest neighbour models and three affinity-based neighbourhood models; it utilizes sequences, protein domains, peptide chain properties, gene expression, secondary structure and interactions. The base models were trained upon annotations from the Gene Ontology, as well as from a publicly available benchmark gold dataset. They show a substantial level of disagreement, implying their effectiveness in collective decision making. Six combination schemes were evaluated for fusing the base model outputs. A Genetic Algorithmically weighted ensemble gives the highest improvement to the best performing base classifier, by displaying an average area under the Receiver Operating Characteristic curve of 92.52%. Moreover, it is capable of determining the biological importance of each data type. Overall, the proposed heterogeneous data ensemble is capable of identifying eight disease related proteins and one disease related protein in a strong and moderate sense, respectively.