It is of great importance to identify new cancer genes from the data of large scale genome screenings of gene mutations in cancers. Considering the alternations of some essential functions are indispensable for oncogenesis, we define them as cancer functions and select, as their approximations, a group of detailed functions in GO (Gene Ontology) highly enriched with known cancer genes. To evaluate the efficiency of using cancer functions as features to identify cancer genes, we define, in the screened genes, the known protein kinase cancer genes as gold standard positives and the other kinase genes as gold standard negatives. The results show that cancer associated functions are more efficient in identifying cancer genes than the selection pressure feature. Furthermore, combining cancer functions with the number of non-silent mutations can generate more reliable positive predictions. Finally, with precision 0.42, we suggest a list of 46 kinase genes as candidate cancer genes which are annotated to cancer functions and carry at least 3 non-silent mutations.
LI YanHui1, GUO Zheng1,2, PENG ChunFang2, LIU Qing2, MA WenCai2, WANG Jing2, YAO Chen2, ZHANG Min2 & ZHU Jing1 1 Bioinformatics Centre, School of Life Science, University of Electronic Science and Technology of China, Chengdu 610054, China
Based on high-throughput data, numerous algorithms have been designed to find functions of novel proteins. However, the effectiveness of such algorithms is currently limited by some fundamental factors, including (1) the low a-priori probability of novel proteins participating in a detailed function; (2) the huge false data present in high-throughput datasets; (3) the incomplete data coverage of functional classes; (4) the abundant but heterogeneous negative samples for training the algorithms; and (5) the lack of detailed functional knowledge for training algorithms. Here, for partially characterized proteins, we suggest an approach to finding their finer functions based on protein interaction sub-networks or gene expression patterns, defined in function-specific subspaces. The proposed approach can lessen the above-mentioned problems by properly defining the prediction range and functionally filtering the noisy data, and thus can efficiently find proteins’ novel functions. For thousands of yeast and human proteins partially characterized, it is able to reliably find their finer functions (e.g., the translational functions) with more than 90% precision. The predicted finer functions are highly valuable both for guiding the follow-up wet-lab validation and for providing the necessary data for training algorithms to learn other proteins.