GeneWeaver
Agent-Based Systems Group : Research : GeneWeaver

Overview | Project Report | People | Publications | Links

GeneWeaver

"The Development of Intelligent Software Agents for Genome Analysis and Protein Structure Prediction"

Introduction

One of the most important and pressing challenges faced by present-day biological scientists is to move beyond the task of genomic data collection in the sequencing of DNA, and to make sense of that data so that it may be used, for example, in the development of therapies to address critical genetic disorders.

The process of identifying genes and predicting the structure of the encoded proteins involves computer-based tasks, including:

  • scanning sequence databases for similar sequences,
  • collecting the matching sequences,
  • constructing alignments of the sequences, and
  • inferring the function of the sequence from annotations of the matched proteins (for which the function is already known).

Predicting the three-dimensional structure of the proteins requires analyses of the collected sequence data by a range of different programs, which sometimes agree but often do not (though they typically provide confidence scores that enable relatively easy interpretation).

Many tools are available to perform these tasks, but they are typically standalone programs that are not integrated with each other. The expert users perform each stage manually and combine them in appropriate ways. For example, the process of trying to find a matching sequence might result in finding an annotated gene, but the annotations include much spurious information as well as the important functional information. The problem here is distilling this relevant information, which is not difficult for an expert, but which might be problematic for a less experienced user. With the amount of data that is being generated, this kind of expertise is critical.

Both the primary data and some of the programs are accessible only over the Internet — by either electronic mail or the WWW (and increasingly the latter). This requires the different sources of information and the different programs to be managed effectively, and further complicates the efficient processing of the genomic data.

The raw data has been accumulating at an unprecedented pace, and a range of computational tools and techniques have been developed by bioinformaticians, targetted at the problems of storing and analysing that data. In this sense, much has already been achieved, but these tools are labour-intensive and usually require expert manual direction and control, imposing huge restrictions on the rate of progress. Essentially, however, the problems involved are familiar from other domains — vast amounts of data and information, existing programs and databases, complex interactions, distributed control — pointing strongly to the adoption of a multi-agent approach.

Project Aims

An agent-based architecture consists of a number of distributed and autonomous software programs known as agents. These interact using a standard communication language that allows them to cooperate with the aim of accomplishing their overall goals. Each agent can 'wrap' heterogeneous data and methods, presenting it to the community of agents in a uniform manner by way of the common agent-communication language. Potentially they provide a framework within which distributed and autonomous resources can be managed and integrated. The principal objective of this project was the application of agent-based concepts to the management and integration of automatic genome analysis and protein structure prediction.

A large number of resources, both data and methods, are freely available over the Internet and potentially can be used for bioinformatics tasks such as genome annotation. Unfortunately the actual use of these data and methods, particularly for automatic systems, is hampered by several factors:

  • Data and methods are distributed across the network and are under the control of third parties who need to be able to modify them as they wish. Any system wishing to integrate and use these resources must take into account the constant changes in the data and that these resources are autonomous and unpredictable (termination of a resource for instance).
  • The data is largely heterogeneous both in terms of data formats and the terminologies (or ontologies) employed. This applies both to primary data stored in various databases and to the data used for input and output by different methods.
  • Access to the various distributed methods using web servers has greatly increased the availability of methods to human users, but they are not entirely satisfactory for a variety of reasons. For example, direct communication between two servers is problematic (and is a common requirement when automatically integrating methods), due to web servers' reliance on natural language interpretation. Furthermore, web servers 'hide' important data, such as the version of any underlying databases employed, which is essential information when considering automatic updating and permits the resolution of inconsistencies which arise when employing a number of servers.

Some of these factors are very wide-ranging and complex, for instance the development of consistent ontologies to be employed for bioinformatics. These require a community-wide approach, for example the Gene-Ontology initiative. Our original proposal did not aim to cover such aspects but limited itself to the application of agent-based techniques to facilitate data and method management for the application fields mentioned above.

The GeneWeaver Architecture

At the start of the project, a number of agent architectures already existed, and we examined the possibility of re-using one of them. None of them was suitable for our requirements, and we thus decided to undertake the development of a specific architecture for the bioinformatics domain, which we named GeneWeaver.

GeneWeaver is a multi-agent system aimed at addressing many of the problems in the domain of genome analysis and protein structure prediction. It comprises a community of agents that interact with each other, each performing some distinct task, in an effort to automate the processes involved in, for example, determining gene function. Agents in the system can be concerned with management of the primary databases, performing sequence analyses using existing tools, or with storing and presenting resulting information. The important point to note is that the system does not offer new methods for performing these tasks, but organises existing ones for the most effective and flexible operation.

Adoption of a suitable agent-based language was seen to be crucial since it acts as a common language between all the agents in the system. We began with an established agent-based language (KQML) which we modified to form the BioAgent Language (BAL). Such a language is the only thing which 'couples' agents together by allowing one agent to influence another agent towards a particular goal, and the agents are otherwise completely autonomous in nature.

One of the primary aspects of the design of GeneWeaver was to make a single agent responsible for both the provision and management of each resource. The prototype system includes three primary database agents (SWISSPROT, PIR and PDB) which provide a number of data services to other agents. The current data services include simple querying of the data and allowing agents to 'subscribe' to data. The BAL language has been designed to allow a variety of different data exchange and querying languages to be employed, any two agents involved in an interaction needing to use one which is common to both of them. This permits easy future extension as standards for data exchange of biological data, for instance XML, emerge.

A second important feature of GeneWeaver is that the database agents automatically update their data (currently using FTP sites) and inform any subscribed agents of relevant changes. The prototype system employs a non-redundant database agent that provides similar data services as the primary database agents but updates its data by subscribing to sequences managed by the primary database agents. Two calculation agents (PSI-BLAST and MEMSAT) are included, which register meta-data about what particular goals they can achieve (for instance 'can derive membrane topology for protein sequence') together with more general data on their methods' accuracy and speed. This allows the calculation agents to be used by other agents in two manners: either directly by commanding the agent to carry out a particular method or by giving a general goal such as 'derive X'.

Calculation agents manage their own methods, so the PSI-BLAST agent updates the underlying databases employed on a regular basis using the other database agents, and the MEMSAT agent re-trains itself using new membrane proteins derived from the SWISSPROT agent. Essentially the agents use services provided by other agents in the community to improve their own services. This can be viewed a rather specialised form of learning. These automatic mechanisms also give the system a novel level of data and method consistency.

The system is open, and new calculation agents may join the community at any time. Even when other agents in the community do not know the exact nature of the new agents, their services may be employed since they are described in general terms of 'can derive X'.

GeneWeaver contains a number of genome agents for simple bacterial genomes. These use FTP to maintain up-to-date copies of their data and use the calculation agents to annotate their data.

The GeneWeaver system is based on a uniform agent model, with a large degree of common code shared between the agents. The differences in behaviour between agents results from the initial loading of different components, such as 'skills' which perform particular actions and motivations' which drive the agent to follow particular goals. The uniform design structure adopted for all the agents should greatly facilitate future expansion of the agent community since new types of agents may be implemented with only small additional amounts of code.

Results

The prototype system has demonstrated the feasibility of this novel approach and has revealed a number of benefits, some not envisaged in the original proposal. It succeeds in providing a limited degree of genome annotation (protein membrane classification and homology) in which the methods can re-train themselves as newly discovered data becomes available. It enables the integration of distributed databases and methods while permitting them to remain under the control of third-parties since the system assumes all agents are autonomous. It should be noted that this architecture seems particularly appropriate for the recently established ideas of GRID-based computing infrastructures.

It is clear that an architecture such as GeneWeaver requires considerable investment in design and development. This gives rise to a substantially more complex system, mainly due to the decentralised control that is inherent, but which offers greater flexibility. One of the consequences of this feature of multi-agent systems, recognised within the agent community, is the need for the wider adoption and development of standards. For example, the FIPA standard for agent-based communication, together with the FIPA-OS as an open-source agent framework, potentially permits much more rapid development of agent systems for bioinformatics in the future (although this remains untested).

In the original proposal, we envisaged a working system for genome analysis and protein structure prediction. We have developed such a prototype system that demonstrates the principal concepts and benefits. Further work on this system is still ongoing at INRA in France, field-testing and extending the prototype system so that it may be used to carry out a first-pass annotation of a number of novel Lactococcus genomes sequenced at INRA.

Whether the encapsulation of databases should be agent-oriented is a moot point. Certainly, related work in the database integration is relevant, and the database community is addressing these issues with a somewhat different tack. Nevertheless, the agent approach offers an over-arching design paradigm, and offers much more in those areas where databases are not relevant, particularly for calculation agents for tool support.

Conclusion

The GeneWeaver system has provided a demonstration of the suitability of the agent approach in bioinformatics to provide solutions to problems with dramatically large amounts of data and a vast array of tools to be encapsulated. While only a limited range of tools has been included in the prototype system, the success of the architectural design points to its effectiveness in larger scale systems. Indeed the underlying principles are the subject of further work at INRA, the focus of an EPSRC E-Science proposal, and the topic of a workshop in 2002.

Future work might aim to extend the range of calculation agents and then to assess the entire system in relation to activity of human domain experts. Further refinements both to the architecture and the individual agent control mechanisms could them be investigated.