Matlab source code (Includes readme file detailing how to use the package)
Input data files (Required)
GO structure data:
Contains term ids, parent-child and ancestor-descendant relationships
constituting each ontology in GO as Matlab matrices, and some information about the GO Biological Process ontology and the set of GO BP classes selected for evaluation in this paper as text files. Note that all contents of this archive must be placed in the same directory as the scripts in the source code package above. Derived from data
downloaded from the GO website (www.geneontology.org) in February, 2008.
GO annotations for yeast proteins:
Functional annotations of yeast proteins with terms in each of the
three ontologies in GO as Matlab matrices. Contents include both the
most specific term annotations obtained from the GOA database
(go_*_annotations; *=cc,mf,bp), as well as those obtained by
propagating these annotations to the ancestors (extended_go_*_annotations; *=cc,mf,bp). Also includes the selected annotations with the GO BP terms used (selected_go_bp_annotations). Note that this file must be placed in the same directory as the scripts in the source code package above. Derived from data downloaded from the GO website (www.geneontology.org) in February, 2008.
Sample data set (Large file!):
Formatted input data derived from Mnaimneh et al. (2004)'s microarray
data set as Matlab matrices. Contents include the gene names
(genelist), gene expression data matrix (data: 6307 genesX215
conditions) and the selected GO BP annotations (final_annotation_matrix:
6307 genesX138 classes). Also includes the relevant data used for the
cross-validation experiments in the paper, namely gene names
(useful_gene_names), gene expression data (useful_genex_matrix),
corresponding pairwise correlation matrix (genegenecorrmatrix) and
annotation matrix (useful_label_matrix). Also included are the
gene-gene correlation matrix derived from the genes in the data set
also annotated in GO (fullgenegenecorrmatrix) and the indices of the
training genes (trainingindices; same as genes used for
cross-validation) and test genes (testindices; genes not annotated with
any of the 138 classes used in the paper). Finally, the AUC scores of
the base k-NN classifiers (avg_basecase_auc_perclass), and those of the
label similarity-incorporated classifiers, using only the 138 classes
(avg_filtersim_auc_perclass) and using all terms in the GO BP ontology (avg_filtersim_auc_perclass_allgo), all using k=20, are also provided for verification purpose. Note that results are produced only for 137 classes due to the requirement of at least ten members in each class to be considered for cross-validation tests and prediction of test examples.
Additional files
Additional file 1:
Details of the 138 functional classes from the GO Biological Process ontology whose subsets (classes having at least 10 members in the corresponding data set) are used for evaluation using several genomic data sets in this study.
Additional file 2:
This figure shows the arrangement of the functional classes aiding the
improvement of the AUC score of the GO:0051049 (regulation of
transport) class (listed in Table 4) in the GO biological process
ontology. Their structural proximity to the target class (GO:0051049)
suggests their potential to help improve the predictions for this class.
Additional file 3:
This figure shows the comparison of the performance of our functional similarity-incorporated k-NN classifiers with individual GEST classifiers for Mnaimneh et al's data set.
Additional file 4:
A detailed list of ranked predictions produced by the label
similarity-incorporated kNN classifiers (first worksheet) and base kNN
classifiers (second worksheet) for the test genes extracted from the
Mnaimneh gene expression data set. The GO terms, arranged in columns,
are sorted from left to right in the order of decreasing AUC
improvements by incorporating functional relationships into the base
classifier for the term. The genes in each column are ranked in
descending order by the score assigned by the corresponding kNN classifier. Genes with the same score (mostly
in the case when the score is 0) were sorted by their ORF name.
Additional file 5:
A detailed list of ranked predictions produced by the label
similarity-incorporated kNN classifiers (first worksheet) and base kNN
classifiers (second worksheet) for the test genes extracted from the
Rosetta gene expression data set. The GO terms, arranged in columns,
are sorted from left to right in the order of decreasing AUC
improvements by incorporating functional relationships into the base
classifier for the term. The genes in each column are ranked in
descending order by the score assigned by the corresponding kNN
classifier. Genes with the same score (mostly in the case when the
score is 0) were sorted by their ORF name.