Abstract:
Gene expression data analysis is a major area in biological system interpretation. Since, gene expression data have large numbers of variables, high dimensional clustering methods are required for analysis. The objectives of this study were to understand the effectiveness of different clustering methods in gene expression data analysis based on biological relatedness and study of the advantages and disadvantages of different clustering strategies in gene expression analysis. The data was obtained from the GSE19830 dataset and the brain tumor data (TCGA project). To test the hard clustering, hierarchical clustering and fuzzy clustering, the K-means algorithm, HClust and topic modeling were used respectively. Prior knowledge about the dataset was required to define the number of clusters (K). Initially, the GSE19830 (Brain, Lung, Liver tissue mixture) dataset was used for developing the clusters. All models clustered the observations similar to the physical tags in the dataset. Secondly, Clustering methods were developed with the brain tumor dataset consisting of 202 samples (four specified physically categorized tumors). According to hierarchical clustering and topic modeling, when analyzing similar tissues, gene expression tumor subtypes (clusters) were not aligned with physical categorization. Finally, 81 cancer genes were filtered and generated a topic model. In order to understand the biological relevance of the final model, Reactome and PCViz tools were used. Reactome results supported topics developed from topic modeling. According to the results, in high dimensional data analysis, topic modeling was found to be a promising approach for gene expression based clustering while K-means was found to be inappropriate for gene clustering.
Citation:
S. P. B. M. Senadheera and A. R. Weerasinghe, "Usage of Topic Modeling Method for High Dimensional Gene Expression Data Analysis," 2021 6th International Conference on Information Technology Research (ICITR), 2021, pp. 1-6, doi: 10.1109/ICITR54349.2021.9657380.