Integration of pathway and protein-protein interaction(PPI) data can provide more information that could lead to new biological insights. PPIs are usually represented by a simple binary model, whereas pathways are represented by more complicated models. We developed a series of rules for transforming protein interactions from pathway to binary model, and the protein interactions from seven pathway databases, including PID, Bio Carta, Reactome, Net Path, INOH, SPIKE and KEGG, were transformed based on these rules. These pathway-derived binary protein interactions were integrated with PPIs from other five PPI databases including HPRD, Int Act, Bio GRID, MINT and DIP, to develop integrated dataset(named Path PPI). More detailed interaction type and modification information on protein interactions can be preserved in Path PPI than other existing datasets. Comparison analysis results indicate that most of the interaction overlaps values(OAB) among these pathway databases were less than 5%, and these databases must be used conjunctively. The Path PPI data was provided at http://proteomeview. hupo.org.cn/Path PPI/Path PPI.html.
The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development of drug targets.In this paper,we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence,including protein-protein interaction network properties,and sequence and functional features.We detected 55 features that were significantly different between cancer genes and non-cancer genes.Fourteen cancer-associated features were chosen to train the classifier.Four machine learning methods,logistic regression,support vector machines(SVMs),BayesNet and decision tree,were explored in the classifier models to distinguish cancer genes from non-cancer genes.The prediction power of the different models was evaluated by 5-fold cross-validation.The area under the receiver operating characteristic curve for logistic regression,SVM,Baysnet and J48 tree models was 0.834,0.740,0.800 and 0.782,respectively.Finally,the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database,and 1976 cancer gene candidates were identified.We found that the integrated prediction model performed much better than the models based on the individual biological evidence,and the network and functional features had stronger powers than the sequence features in predicting cancer genes.
基于质谱的大规模蛋白质鉴定中,在线液相色谱分离发挥了重要作用。色谱保留时间(retention time,RT)是肽段鉴定和定量的重要信息。由于整个色谱分析运行时间中,流动相中的有机相采用了非线性浓度曲线以及样品中肽段之间的相互影响等因素,基于肽段序列的RT预测还存在精度不高、模型推广性能差等问题。本文提出了一种基于串并联支持向量机(serial and parallel support vector machine,SP-SVM)的RT预测方法,能够表征洗脱过程中有机相浓度的非线性变化和肽段之间的相互影响,显著提高了肽段保留时间预测的精度。利用复杂样本数据集验证结果表明,预测RT和实验RT之间的决定系数达到了0.95,超过95%的鉴定肽段的RT预测误差范围小于总运行时间的20%,超过70%的鉴定肽段的RT预测误差范围小于总运行时间的10%。本文提出的模型的性能达到了目前已知的最好水平。