Agent-Based Systems Group : Research : GeneWeaver

Overview | Project Report | People | Publications | Links


The GeneWeaver project involves the development of a flexible system for automatic genome analysis and annotation.

The project has been funded by the BBSRC/EPSRC Bioinformatics Initiative from April 1998 to March 2001.

A number of stages are involved in genome analysis, these include:

  • Assembly of contigs generated by sequencing machines.
  • Detection of open reading frames (ORFs) in the assembled genome.
  • Assignment of functional descriptions to the proteins.
  • Assignment of structural features to the proteins.
  • Detection of regulatory units such as promoters, enhancers and silencers.
  • Construction of metabolic pathways for the organism by considering the different gene products.

Other groups have already written successful bioinformatics software to perform the analysis required for a number of these steps. GeneWeaver provides an architecture which integrates these applications into a single system which can automatically analyse genomes and also efficiently manage the data generated.

The development of a large system which integrates heterogenous components requires a number of guiding principles. GeneWeaver is guided by the following principles:

  • It should be easy to integrate new analysis methods. It should also be easy to extend the data types handled by the system since new methods and new functionality may require additional data types.
  • It should be possible to distribute both tasks and data across a network. A number of benefits arise from this. Computationally intensive tasks can be load-balanced amongst a number of machines and tasks which require specialized hardware such as large backup storage can be run on appropriate machines.
  • Primary data sources, such as sequence databases, should be constantly monitored and any changes to the data should be automatically incorporated. Essentially, the environment should be constantly monitored and the system should react to changes which occur.
  • All data in the system should have dates of last modification and last update. Also dependencies of some data on other data need to be tracked at a fine level of granularity. This will allow the system to update automatically all relevant information when a particular item of data changes without reanalyzing the complete genome.
  • All data in the system should have a degree (or may have many degrees) of confidence associated with it. It has been pointed out (Karp, 1998) that a key deficiency of current sequence databases is the lack of a reliability score attached to the functional annotation. This results in further annotation being based on annotation which may be unreliable.
  • The system should consist of loosely-coupled modules which can be easily combined to give additional functionality. The software interface to these modules should be open so that third-party modules can be incorporated.
  • All data should contain histories of how it was derived from other data. This provides an audit trail which a manual annotator can use to determine the likely accuracy of any unusual cases.

These requirements are very naturally satisfied by the model of multiple interacting software agents. Software agents are becoming increasingly popular, with a correspondingly large number of available texts (eg. Bradshaw, 1997; Knapik and Johnson, 1998) and a range of different varieties of agent architectures. GeneWeaver is based on a multi-agent system in which each agent takes on a particular responsibility or expertise. For example an agent may be responsible for keeping a non-redundant database updated, managing a genome and its data or performing homology searches (using whatever methods the agent chooses as appropriate to a particular situation). The agents coordinate their activities by sending messages to each other to accomplish overall tasks. Each agent can be viewed as an individual program with the following properties:

  • Persistence Agents run continuously so that the system can react to changes in the data.
  • Reactivity Agents can respond to a changing environment. For example, if an external primary database changes, the agents can react to it.
  • Autonomy Agents are able to function without human intervention. This also ensures that the system is robust since no agent can rely on something being successfully done by another agent since the other agent is autonomous. All agents thus need to be designed to cope with failure in others.
  • Pro-activeness Agents behave in a goal directed fashion. So an agent may be told to determine a homologue for a particular protein but will not be explicitly told to run a particular method such as PSI-BLAST. Expertise in particular tasks is encapsulated into particular agents which simplifies system development.
  • Social ability Agents interact with other agents by communicating in a high-level language.

A multi-agent system should provide a very flexible and open architecture which allows annotation of genomes, kept as up-to-date as possible.

Bradshaw, J.M. eds. (1997) Software Agents American Association for Artificial Intelligence, Menlo Park, California.

Karp, P.D. (1998) What we do not know about sequence analysis and sequence databases. Bioinformatics, 14, 753-754.

Knapik, M. and Johnson, J. (1998) Developing Intelligent Agents for Distributed Systems McGraw-Hill, New York.