Third year CSSE projects 2001-2002

Tom Naughton, Room 2.104, tom.naughton@may.ie

Project title: A general-purpose distributed computing environment for DNA analysis
Target: BSc CSSE, year three.
Duration: 3 months.
Number of students required: 2.
Pre-requisites: strong Java programming skills; an interest in networking/operating systems/parallel processing.
Level of difficulty: high

Overview of the project
A programmable distributed computing environment has been built at NUIM as part of a CSSE 4th year project. In this project, two CSSE 3rd year students are required to continue developing the system.

The distributed computing environment consists of a server capable of accepting (i) an algorithm, (ii) a data set, and (iii) instructions on how to partition the data and recombine the results. Client software to request a task (in the form of a Java bytecode algorithm plus a block of data) and return the results has also been written. The server combines the results from several clients. The system could be described as a general-purpose "SETI-at-home" system. So far, we have programmed the system to analyse the DNA of one strain of tuberculosis bacteria. The analysis consists of searching the string of 4.4 million characters for repeated substrings. This brute-force search is complicated by the need to accomodate mutations, insertions, and deletions. Client software has been installed on 180 machines in the department and we have already completed many Pentium-years of processing in a few short weeks. The clients communicate using TCP/IP so we could in principle expand this to other computers in the University and beyond.

Much work remains to be done. We have had success running the client as a low-priority service in Windows NT, but this required the use of a third-party application. We would like to implement this functionality ourselves and, in addition, run the clients as low priority tasks under Solaris. Efficient string manipulation algorithms are also required, as are special purpose data partitioning strategies. For the server, a graphical interface is needed. Selected administration tools would be required, such as monitoring the progress of individual machines, increasing/decreasing the data block size, etc. It would be desirable, though not absolutely necessary, to permit physical separation of the interface and the server so that it could be administered remotely. Advanced improvements include incorporating job queueing/job scheduling capabilities into the server, coping with client failure, and developing a user friendly client-software installation package.

To a certain extent, the project can be tailored to an applicant's interests and abilities. However, the applicant must have strong Java programming skills (in order to be able to understand the current code and modify it effectively). In addition, the applicant must have the confidence to work on his/her own. Personal supervision will only be possible for approximately half of the three-month placement period, with email contact at all other times. A 4th year thesis on this project will be available for consultation (including documentation for the current system) as will limited system administration help and advice from the technical staff at the Department. Successful students may have the opportunity to publish their findings at an international conference in June 2002. An electronic version of this page is located at http://www.cs.may.ie/~tnaughton/typ2001-2002.html.

Relevant literature
Berkeley's SETI@home webpage and similar distributed computing projects
The Cilk project at MIT
IBM's Distributed Computing Environment
NASA's Beowulf Project
A report on Distributed Cluster Computing Environments including links to other DCE environments
Sun article on RMI and Java distributed computing

Bioinformatics.org website
Journal of Bioinformatics, Oxford University Press (Free online access)
Introduction to Bioinformatics by Arthur M. Lesk, Oxford University Press, 2002
Developing Bioinformatics Computer Skills by Cynthia Gibas and Per Jambeck, O'Reilly, 2001

Deliverables
Deliverable for student 1. Client software installed as low-priority service under NT and Solaris. Algorithms for DNA analysis implemented, including efficient schemes for partitioning the data.
Deliverable for student 2. Creation of a (possibly remote) interface for the server. Development of administration tools for the system.

Non-standard software or hardware required for the project
None.


NUIM Logo