Some downloads won't work as we are still working on this section. Apologies for the inconvenience.

Download database dumps

Download a standalone PROSPECTR

Replicating PROSPECTR

Modifiying PROSPECTR

The ARFF file format

Alternating decision trees are created using the Weka platform. Once a tree has been created, you can use the classifier.pl script to classify genes. Both Weka and the classifier take ARFF files as input. The ARFF format is described here.

For example, the ARFF files used to train PROSPECTR looks something like this:

@relation omim_training_set

@attribute chips numeric
@attribute family_size numeric
@attribute exons numeric
....
@attribute sigpep {0,1}
@attribute rat_upstreams numeric
@attribute chicken_upstreams numeric
@attribute phenotype {disease,normal}

@data

54.1, 15, 6 .... 1, ?, ?, disease
etc.

The @relation tag specifies the name of the dataset and is followed by one or more @attribute tags. The first word after the attribute tag is the name of the attribute. The second is the type - numeric for continuous variables or {different,choices,seperated,by,commas} for discrete variables. Each @attribute is on a seperate line.

The @data tag marks the end of the header. Each line underneath it represents a single instance (in this case, a single gene). The attributes are comma delimited in the order that they appear in the header. Missing values are marked by a single question mark.

Converting your own data into ARFF format is probably the hardest aspect of creating your own alternating decision tree. The Cahit Arf utility is able to import data from relational databases directly into ARFF format - otherwise you will have to write your own scripts or create files manually using a text editor.

Alternatively, you may use the same ARFF format as Prospectr. If you want to experiment with different combinations of genes (for example, a subset of disease genes) rather than different features then this may be a good choice. An ARFF file containing all genes from Ensembl Mart 27.1 with a selection of sequence features as attributes is available from the authors upon request.

As an example, imagine that you wish to add a new feature to those already used by PROSPECTR - "iep". This will represent the isoelectric point of the protein. Change the header of your ARFF files by adding the new attribute underneath the @relation attribute.

@relation omim_training_set

@attribute iep numeric
@attribute chips numeric
@attribute family_size numeric
@attribute exons numeric
....
@attribute sigpep {0,1}
@attribute rat_upstreams numeric
@attribute chicken_upstreams numeric
@attribute phenotype {disease,normal}

@data

54.1, 15, 6 .... 1, ?, ?, disease
etc.

Then calculate the isoelectric point for each protein in your training set and then insert it at the beginning of the appropriate line followed by a comma.

7.86, 54.1, 15, 6 .... 1, ?, ?, disease

Adding new attributes from the front - i.e. adding to the beginning of the line instead of the end - is easiest because by default Weka assumes that the last @attribute is the classification of the data.

Creating the tree

Once you have created a training set, you can use Weka to create a decision tree.

java weka.classifiers.trees.ADTree -t [name of your training set] -T [name of your training set] -B 15
The -B option specifies the number of nodes in the tree. See the Prospectr paper for an explanation of how node number effects classifier performance.

You can save the tree (the "model") to disk for future use by giving Weka the -d [model filename] option. Once you have saved the tree you can use it with the -l [model filename] option to calculate performance on your test sets like so:

java weka.classifiers.trees.ADTree -l [model filename] -T [name of your test set] -i

The -i option tells Weka to output detailed performance measurements.

You can also use Weka to output the classifications of a set of genes. See the tutorial and user manual included in the Weka distribution for more details.

© 2004 University of Edinburgh