| LRRP | A Proposal to develop Data Mining at CADC |
| David Schade | National Research Council |
This growth in archive facilities and the quantity of data that they contain will become more and more dramatic in the next decade due to new survey projects (the Canadian Galactic Plane Survey, the Sloan Digital Sky Survey, and the CFHT MEGACAM survey are only a few examples) and due to the growing awareness that archiving is an important component of the output of both ground-based (Gemini, VLT) and space-based (HST, NGST, FUSE) observatories.
The Canadian Astronomy Data Centre sees great opportunities ahead. We will design, and we are bidding to host, the Gemini archive and we intend to play an important role in the NGST archive effort as well as the archives of other space missions with Canadian involvement, for example, FUSE and CUVIT.
In this document we outline a vision of the data archiving facilities and services that Canadian astronomers will need during the next 5-10 years in order to remain competitive and to excel in their selected areas. We also outline a proposal for development of the CADC into the facility that will meet those needs in a cost-effective manner.
We believe that innovative archiving and the development of advanced data mining tools to exploit those archives is a key strategy that smaller members of major international facility partnerships---a role that Canada often plays---can use to ensure that we get the maximum in scientific opportunities in exchange for our investments.
The CADC (one of three sites hosting complete sets of HST data and soon to host the Canadian Galactic Plane Survey) has a record of successful international collaborations (ST-ECF, ESO, STScI, RGO, AAO, CDS Strasbourg) which are important as we work towards the development of effective cross-archive services. CADC continuously monitors developments in commercial software that might provide effective off-the-shelf solutions to some aspects of our archiving problems. We are interested in providing a scientifically powerful and cost-effective product to our users and have no interest in re-inventing anything. We maintain close links with university astronomers (we currently have funding for a joint UVIC-CADC position in Java tool development as part of the Canadian Computing Collaboratory) and other NRC institutes (in particular the Institute for Information Technology). We are exploring affordable parallel supercomputing (the BEOWULF project) in collaboration with the Dominion Radio Astrophysical Observatory. In accord with NRC policies, CADC is pursuing opportunities to commercialize components of the CADC environment in order to generate revenue to help support our activities.
There are large survey projects looming on the horizon such as the Sloan Digital Sky Survey, MEGACAM Surveys, and the 2MASS infrared sky survey. Each project represents factors of one to several orders of magnitude increase in data storage requirements with concomitant increases in capacity for data processing, cataloguing, and distribution. This new generation of observational projects represents unprecedented scientific opportunities, and advanced archive facilities are a fundamental part of their effective exploitation. Canada need not host all of these archives on our territory but we need the ability to access and analyse effectively all public archives regardless of location. Existing resources at CADC are not sufficient to ensure that Canadian astronomers have the access and capabilities needed to exploit these new opportunities.
There will also be a boom in multi-wavelength astronomy as the archives are populated with well-calibrated datasets. The tyranny of the energy band that restricts astronomers to working in their ``native wavelength regime'' will be broken as the requirement for a deep technical understanding of every instrument that one uses is alleviated. This is all part of the gradual shift---induced by space missions, queue-mode observing, and data archiving---of the technical burden away from the individual scientist, thus freeing that scientist to focus on physics rather than technical considerations.
In summary, we see a quiet revolution occurring among graduate students and young astronomers who are not tied to conventional proposal/observing/reduction/analysis routines but are eager to operate in whatever mode is the most scientifically effective. This will result in the best possible science.
The CADC response to this rapidly changing environment is the present proposal to develop Data Mining capabilities. In the most general sense, Data Mining is defined as the extraction of knowledge from very large databases. In the astronomical case there are two distinct phases to the process. The first is the creation of a Data Warehouse and the second is the exploitation of the contents of that warehouse using whatever analysis tools are most effective.
The Data Warehouse phase is equivalent to the creation of a ``good'' archive, that is, one where the data are fully characterised and calibrated, catalogued, searchable, and retrievable. This is a problem that we understand. HST is a useful observatory model for the creation of a good archive. Our CFHT experience has shown us many of the pitfalls in implementing an archive for a ground-based observatory. Our role as a Data Centre will remain as the core of our activities although we realize that we need to develop intelligent query and retrieval tools that operate effectively across archive boundaries.
For the second phase of the Data Mining problem we need to develop the capability and the tools which will allow post-processing and analysis of the results returned by user queries, beginning with a few highly-specialised functionalities and gradually building a generalised library of data mining tools. These tools need to operate both on tables of information (e.g., cross-identifications of entries in multiple catalogues) and directly on pixel data (for example, identification and retrieval of a large number of corresponding X-ray, optical, and infrared images, reduction to a common scale, and the creation of colour maps). More ambitious programs might re-process an entire imaging archive through a new user-supplied algorithm.
We are proposing a four year development project with a number of specific goals (outlined below). But it is important to note that it will take at least the entire next decade to implement a sophisticated, general, and flexible Data Mining environment which provides a large fraction of the functionality that astronomers are capable of imagining. The present proposal begins that process for the TERAPIX dataset and will deliver prototypes of tools that will be more general. CADC will need to identify other resources in 2003 to sustain the necessary effort.
TERAPIX (http://terapix.iap.fr) is effectively a project to create a data warehouse and this represents an ideal test-bed for development of high-volume, automated pipeline processing. Pipeline processing of astronomical data is the essence of the first phase of data mining. We understand very clearly that data that are not reliably calibrated have very little value for research purposes. TERAPIX data volumes of 100 Gbyte per night and well-specified science requirements governing data quality will provide a challenge to pipeline design and execution. We need to meet that challenge (as we have with the HST archive on a smaller scale) as a step towards our goal of providing fully-calibrated data for all of our projects (e.g., Gemini).
The frames that are the end product of the image processing pipelines of TERAPIX will be further processed to provide catalogues of photometry, astrometry, morphology, and other derived parameters. These catalogues form the input for catalogue-level data mining tools which will provide the capability to explore the contents of the TERAPIX dataset and must also be capable of intercomparisons with other datasets from a variety of sources.
Pixel-level data mining is the capability to access, re-process, and re-analyse the MEGACAM images for a specific scientific goal. The belief that new ways of looking at astronomical pixel data and that new ways of combining those pixel data with other data will yield new scientific insights is the belief that motivates the existence of data archives. We need to provide the capability for data mining on the pixel content of our resident archives as well as a cross-archive functionality (for example, retrieval of data from another site, combination with our resident data and joint re-analysis of the combined data).
The specific goals that we propose to achieve by the year 2003 are:
Although TERAPIX is central to our plans, our emphasis is on developing generic tools which will be capable of operating on any dataset. The capabilities we propose will prove extremely useful during the design and planning phases on observational programs on Gemini and other telescopes.
During the next few years CADC needs to play a role in adopting internationally-accepted protocols for querying astronomical information services and returning query results. These need to be in place to enable the powerful distributed queries that are so important for exploiting archive and other resources.
In addition to human resources, we require substantial new storage and computing hardware and software over the three years of the proposal to deal effectively with very large data volumes. Present-day CPU capabilities are sufficient for first-stage processing of TERAPIX data if the inherent opportunities for parallel processing are exploited. However, serious Data Mining applications will require an increase above existing CPU power. Large amounts of magnetic disk space are needed and very fast networks are also required. CADC access to CA*net II combined with a fast CPU purchased with a special NRC Vice Presidential allocation, have enabled us to begin working on the TERAPIX project.
Our hardware requirements in the final year of the development project depend upon the outcome of the CFI proposal to develop a Computational Collaboratory, of which CADC TERAPIX is one component. We have estimated requirements assuming that some resources will be provided by the CFI proposal for large-scale Data Mining work.
This proposal provides a framework for estimating the costs of developing new services which we believe are essential to maintain the CADC in a leading role among international data archiving facilities. We believe these new services are needed by Canadian research astronomers to maintain a competitive posture.
The costs can be broken down into costs which HIA is willing and able to bear as part of its support of CADC and those costs which need to be met by new funds. The HIA contribution amounts to $0.4 million and includes salary of existing staff and some of the indirect costs of supporting new staff. The 3.5 FTE (2 astronomers and 1.5 Computer Scientists) for 3 years with benefits and travel amount to $0.95 million and we require new hardware and software at a cost of $0.55 million over 3 years. Thus we require new funding at a level of $1.5 million over 3 years.
Although commercial data mining software may play a role in our system (just as the SYBASE commercial database software already does) we believe that we need to design and write many of the tools ourselves. There are several reasons for this. There exist many established astronomical data analysis algorithms which we intend to import into the high-volume Data Mining environment. These tend to be specialised and would not be available in commercial software. Our data volumes and image format are different from most commercial data mining applications and astronomical data frequently have signal-to-noise ratios very different from other (e.g., medical or satellite) imaging applications. On the other hand, statistical analysis and other operations on catalogue data might be dealt with using commercial packages. We envision a system that takes all of these factors into account.
In contrast to prevailing views of Data Mining, we do not necessarily see machine-learning algorithms (e.g., memory-based reasoning or artificial neural networks) as being a core part of our initial system, although they may very well become increasingly interesting as we develop faster and broader access to larger astronomical datasets. We see our first problem in Data Mining as the implementation of established astronomical analysis tools within the context of a high-volume processing environment. Commercial packages will not take us in that direction. Furthermore, at the core of all successful computer-based astronomical information services (data archives, NED, VizieR) is one essential element: the continual injection of scientific knowledge and expertise into the tools and services. This can only come from astronomers.
We have strong interest from the private sector in marketing part of our existing data archiving and distribution systems and we will continue to be sensitive to such industrial opportunities for the data mining system that we propose to develop.