Supplementary information for "Incorporating Functional Inter-relationships Into Protein Function Prediction Algorithms"

Incorporating Functional Inter-relationships Into Protein Function Prediction Algorithms (BMC Bioinformatics, 10:142)

Please contact Gaurav Pandey at gaurav@cs.umn.edu in case of any problems with the information provided below.

Matlab source code (Includes readme file detailing how to use the package)

Input data files (Required)

GO structure data: Contains term ids, parent-child and ancestor-descendant relationships constituting each ontology in GO as Matlab matrices, and some information about the GO Biological Process ontology and the set of GO BP classes selected for evaluation in this paper as text files. Note that all contents of this archive must be placed in the same directory as the scripts in the source code package above. Derived from data downloaded from the GO website (www.geneontology.org) in February, 2008.
GO annotations for yeast proteins: Functional annotations of yeast proteins with terms in each of the three ontologies in GO as Matlab matrices. Contents include both the most specific term annotations obtained from the GOA database (go_*_annotations; *=cc,mf,bp), as well as those obtained by propagating these annotations to the ancestors (extended_go_*_annotations; *=cc,mf,bp). Also includes the selected annotations with the GO BP terms used (selected_go_bp_annotations). Note that this file must be placed in the same directory as the scripts in the source code package above. Derived from data downloaded from the GO website (www.geneontology.org) in February, 2008.
Sample data set (Large file!): Formatted input data derived from Mnaimneh et al. (2004)'s microarray data set as Matlab matrices. Contents include the gene names (genelist), gene expression data matrix (data: 6307 genesX215 conditions) and the selected GO BP annotations (final_annotation_matrix: 6307 genesX138 classes). Also includes the relevant data used for the cross-validation experiments in the paper, namely gene names (useful_gene_names), gene expression data (useful_genex_matrix), corresponding pairwise correlation matrix (genegenecorrmatrix) and annotation matrix (useful_label_matrix). Also included are the gene-gene correlation matrix derived from the genes in the data set also annotated in GO (fullgenegenecorrmatrix) and the indices of the training genes (trainingindices; same as genes used for cross-validation) and test genes (testindices; genes not annotated with any of the 138 classes used in the paper). Finally, the AUC scores of the base k-NN classifiers (avg_basecase_auc_perclass), and those of the label similarity-incorporated classifiers, using only the 138 classes (avg_filtersim_auc_perclass) and using all terms in the GO BP ontology (avg_filtersim_auc_perclass_allgo), all using k=20, are also provided for verification purpose. Note that results are produced only for 137 classes due to the requirement of at least ten members in each class to be considered for cross-validation tests and prediction of test examples.

Additional files

Additional file 1: Details of the 138 functional classes from the GO Biological Process ontology whose subsets (classes having at least 10 members in the corresponding data set) are used for evaluation using several genomic data sets in this study.
Additional file 2: This figure shows the arrangement of the functional classes aiding the improvement of the AUC score of the GO:0051049 (regulation of transport) class (listed in Table 4) in the GO biological process ontology. Their structural proximity to the target class (GO:0051049) suggests their potential to help improve the predictions for this class.
Additional file 3: This figure shows the comparison of the performance of our functional similarity-incorporated k-NN classifiers with individual GEST classifiers for Mnaimneh et al's data set.
Additional file 4: A detailed list of ranked predictions produced by the label similarity-incorporated kNN classifiers (first worksheet) and base kNN classifiers (second worksheet) for the test genes extracted from the Mnaimneh gene expression data set. The GO terms, arranged in columns, are sorted from left to right in the order of decreasing AUC improvements by incorporating functional relationships into the base classifier for the term. The genes in each column are ranked in descending order by the score assigned by the corresponding kNN classifier. Genes with the same score (mostly in the case when the score is 0) were sorted by their ORF name.
Additional file 5: A detailed list of ranked predictions produced by the label similarity-incorporated kNN classifiers (first worksheet) and base kNN classifiers (second worksheet) for the test genes extracted from the Rosetta gene expression data set. The GO terms, arranged in columns, are sorted from left to right in the order of decreasing AUC improvements by incorporating functional relationships into the base classifier for the term. The genes in each column are ranked in descending order by the score assigned by the corresponding kNN classifier. Genes with the same score (mostly in the case when the score is 0) were sorted by their ORF name.