CSIRO Computing History, Chapter 6
Chapter 6. The CSIRO Supercomputing Facility – Cray Research and University of Melbourne: mini-supercomputers
Last updated: 30 Oct 2023.
Added SSG and Help Desk staff photo.
Updated: 16 Aug 2023.
Added information about life sciences conferences
Updated: 26 Oct 2020.
Added information about bulletins. Mini-supercomputers
Robert C. Bell
previous chapter — Computing History Home — next chapter
Around 1991, economic conditions soured, a possible offset arrangement for a joint supercomputing facility by Leading Edge Technologies (LET) with Boeing fell through, and various other events led to LET experiencing financial difficulties.
After receiving warning from Greg Batchelor (then CSIRO’s equivalent of a CIO), asking about the safety of CSIRO data, an off-site backup facility was established, with a second copy of all the CSIRO data being taken off site.
When it became clear that the Joint Supercomputer Facility and LET were not going to survive, Cray Research proposed that CSIRO acquire a Cray Y-MP4E/ 364 to be sited at the University of Melbourne, and to be jointly managed by Cray Research and the University. This (SN 1918, the “end of the war”) came into operation on 1st August 1992, with all data being transferred from the old facility using the off-site backup tapes. LET went bankrupt, and the equipment was sold for bargain basement prices. Here is a photo of SN1918 in the Thomas Cherry building at the University, with Ian Robinson in attendance and Alan Bell, Director of the University’s Information Technology Services, in the distance.
The new facility was called the CSIRO Supercomputing Facility (CSF). Cray Research provided a systems administrator (Peter Edwards) and a Help Desk service (Eva Hatzi), while the University provided hosting services including operators for two shifts per weekday and some weekend shifts for the manually mounted cartridge tapes used by DMF. In June 1993, a StorageTek Automated Cartridge System capable of holding 6000 cartridge tapes was brought into operation, funded by Divisions through their contributions to the Development Fund. This provided CSIRO users with an automated Hierarchical Storage Management system, giving the illusion of infinite storage capacity! A fourth processor was acquired in January 1994 with help from money brought in by an America Cup syndicate.
In 1992, a bulletin facility called CSFbull was started, providing technical information to users – its was available through a command on-line, was featured in the login message, and was e-mailed to users. Here is the contents list of the first issue.
CSIRO Supercomputing Facility Bulletin 1. Robert Bell. 1992 Nov 24. Use the command csfbull to read bulletins. 1. Introductory Guide. 2. Job Ordering. 3. Queues. 4. Quotas. 5. Startup. 6. Gaussian 92. 7. Machine registration. 8. Fifth Australian Supercomputing Conference.
There were 64 issues through to November 1997, when it was replaced by an HPCbull facility for the HPCCC, and continued through 281 issues to April 2019:
CSIRO IMT Scientific Computing For a WWW version of this HPCbull, please see https://confluence.csiro.au/display/SC/HPCbull+281 Bulletin 281. 2019 April 5 Use the command hpcbull to read bulletins. Also hpcbull contents, hpcbull index, hpcbull nn, hpcbull -h 1. CSIRO - Decommission of /data HPC File System ($DATADIR) 2. CSIRO - Call for Early Adopters to Test New Flush File System 3. Conference - C3DIS and DSW
Here’s a brochure from the CSF era, showing pictures from some of the applications:
CSIRO.Supercomputer.1_Gflop.brochure
In October 1996, the tape drives on the system were upgraded to provide a doubling of capacity and greater throughput.
The Share Scheme
This scheme, introduced in August 1990, allowed Divisions to contribute a voluntary amount to a Development Fund. Bigger contributions were rewarded with a larger share of the system resources, and the Development Fund contributions were used to enhance the system, particularly for storage. (It was based on a model used in Europe for supporting network infrastructure).
This scheme ensured that the system was not financed by charging for past usage, e.g. dollars per processor hour used. Such financial arrangements had inhibited usage of previous systems, resulting in expensive capital items being under-utilised, and making the case for the provision of such systems difficult to sustain. Once the capital and maintenance costs of a system are covered, then there is little difference in operating costs whether a system is idle or fully utilised.
With the share scheme in place, utilisation of available time averaged 98.5% after the installation of the ACS. This is comparable with the best sites in the world.
In a variation suggested by Dr Bob Frater, in later times the shares were set proportional to the square root of the contribution, thus favouring small contributors.
Mini-supercomputers
During the 1970s and 1980s, many Divisions acquired what became known as minicomputers, from DEC, HP, Sun, SGI, etc. These supported data acquisition, visualisation and smaller computer tasks. In the late 1980s and early 1990s, some Divisions acquired what became known as mini-supercomputers.
The Australia Telescope National Facility/Radiophysics at Marsfield, NSW had a Convex C220. The Biomolecular Research Institute in Parkville, Victoria had a Convex C3210. The Division of Exploration and Mining in Brisbane had a Convex C220, incorporated into the Queensland Supercomputing Laboratories, a joint venture involving the Division and the Queensland Government. The Division of Information Technology had a Maspar in Canberra, while the Division of Exploration and Mining had a Maspar in Perth. The Division of Atmospheric Research had an 8 processor Silicon Graphics machine and later acquired a Cray J90. Here from 9 Sep 1991 is a press report on the Queensland Supercomputing Laboratories. It was pertinent that it was seen as a competitor to Leading Edge Technologies, with the press release making favourable comparisons of the QSL and LET charging rates (for different machines – Cray Y-MP and a Convex C220).
Ross Dungavell (private communication, 2020), added:
From Memory the SPP and the StoTek PowderHorn arrived together, the C220’s (there were two on site ours and Digikey’s) used it via NFS.
Our C220 arrived in early 1992 when our group was still at UQ, (we also had a C1 from late 1991 to 1993 or so).
I believe the C220 had the distinction of being the first machine in the country with a gigabyte of RAM. It also had FDDI networking.
We then in 1994? Got a Convex Meta Series which was a cluster of 16 HP 720’s with a high speed, low latency coax based network who’s name I forget. It was delivered twice as the first shipment was dropped, probably from Aircraft cargo door height and was severely damaged.
The SPP-1600/Exemplar + StoTek was delivered in second half 1995 and was in production in early 1996, I went to the US in December 1995 for a few weeks of training in the HFS software (which had a few problems).
By 2000 workstations/desksides outperformed the Convex gear and the C220 was scrapped. The StoTek was transferred offsite to the state geological survey for use in its principal role as a repository of seismic data.
The rack from the SPP-1600 disks still remains and is in use as a server rack today.
Audits
The CSIRO internal audit section conducted an audit of the CSIRO Supercomputing Facility, and then proceeded to audit the mini-supercomputer facilities in 1992.
Retrospective
Here are some notes from a retrospective on the CSF written in August 1997, just before its termination in September 1997.
The Cray Research systems have grown to be very reliable, both in hardware and software. The CSF Cray has averaged 99.98% availability in prime time, 98.6% availability in all time, and over two months between unscheduled interruptions in later years.
The systems have proved to be productive for users, with good compilers, libraries and tools. The provision of advanced scheduling tools, queuing facilities, storage management software, and robust tape handling are important for smooth operations. The usefulness of the system is indicated by the average of 1.5 days backlog of work in the queues over the last two years.
The Storage management
The provision of automated managed storage has proved to be the key productivity advantage of the system. The Cray Research Data Migration Facility software, which provides an overflow capability for the file systems on the Cray where the users are working, has proved to be ideal for users. The data blocks from large and inactive files are moved to tape, but directory and other file information is retained on-line. When a user accesses an off-line file, or issues a command to retrieve off-line files, the system automatically restores the file data to disc. Retrievals can take less than a minutes for small files.
This set-up has enabled users to tackle problems involving access to tens of Gbytes of data, and has allowed users to store data for analysis. without having to be concerned with file indexing and tape management.
Currently, the CSF holds two copies of 1.4 Tbyte of user data, i.e. 2.8 Tbyte in total. Most of the second copies are held off-site, for disaster-recovery purposes.
The Support
Support for users and for running the systems is vital to users. Under the Facilities Management agreement with Cray Research, the CSF is provided with a Help Desk person, who looks after the day to day operations of the system, and is the principal contact for users with problems. The Help Desk person has been critical to the success of the CSF. The FM agreement also provides for a Cray Research Analyst, who developed advanced software for scheduling, and has run the system professionally, particularly in the areas of storage management and scheduling.
The Supercomputing Support Group (SSG) has provided the high-level support for users, expertise in the use of specialist software, technical management of the services, and research expertise for users.
Support has come too from the suppliers, particularly Cray Research and Storage Technology, who have gone beyond the terms of their contracts to support CSIRO, and from the University of Melbourne, which has provided a high-standard of physical services and operational support, so that there have been very few interruptions to service because of environmental conditions.
Within CSIRO, Dr John O’Callaghan has provided leadership for HPC in CSIRO over the entire period of this report. Dr Bob Frater has supported the continuation of HPC facilities in CSIRO, when the need was being questioned. And since becoming CSIRO Chief Executive, Dr Malcolm McIntosh has strongly supported the establishment of a first-class facility to support CSIRO’s scientific research.
The Division of Information Technology and its successor CSIRO Mathematical and Information Sciences have provided a strong administrative base for the facilities and the SSG.
Unfinished business
There are several areas where the SSG has struggled. These include software, and the encouragement of wider use of HPC.
There is a chicken and egg situation with software. Many users will not use HPC systems unless there are applications packages available to meet their needs. However, the higher the performance of a system, the fewer of them there are installed, the smaller the software base, and the higher the package prices. It has been difficult to justify spending over $10k on software packages which may in the end never be used.
The SSG has worked hard to encourage the use of HPC by CSIRO scientists in disciplines which have not in the past been significant users. Two conferences were organised by the SSG on the theme of Computational Challenges in Life Sciences, but although these were well received, there was not significant flow on to wider usage.
Summary
The CSF and before it the JSF have provided first-class services to CSIRO scientists who need large computing resources for their research and industrial applications. The right financial arrangements, systems, storage management and support have been vital to the productivity of users.
Some quotations from those users (addressed to the SSG or Help Desk person) conclude this report.
“You all do a great job, and (for what it’s worth) I’ve only ever heard good things said about your bunch!”
“Anyway, thank you very much for all of your help these last few weeks. I really appreciate the prompt responses to my frantic needs.”
“Thanks for such prompt help, its a great service you run.”
“Thanks to you and your team for all your assistance … this is the best experience with a central computer facility that I have had in my 20-year history.”
“A short note to express my appreciation of the excellent performance of the Cray YMP “cherax” computer and data migration system and the completely professional help and expertise from all staff.
In the four years I have been using the system I have never lost any data or code or even a single model run from the system. The system has run with absolute reliability.
Whenever I have needed assistance in solving coding or computer system related problems I have received prompt effective help. Further, when I have accidentally deleted runs they have without fail been promptly reinstated from system backups. And when I have suffered from annoying coding problems, I have received calm, tolerant help. Simply, its the best quality computer system I have ever used.”
Applications profiles
Here is an extract from a report in 1993 about the usage from CSIRO Divisions, highlighting some of the applications of the time.
Appendix: Research Projects using the CSF
The following highlights the CSIRO research projects which have benefited from the use of the Cray Facility and Support Group. The list is ordered by Institute and Division. For each Division, the peak percentage monthly usage since August 1992, and (where higher) the all-time peak percentage usage is shown. A usage of 1% currently represents about 18 processor hours.
2.1 ANIMAL PRODUCTION AND PROCESSING
Animal Health
Animal Production
Food Science and Technology (10.2%)
Third biggest using Division of the CSF in 1992-93. Work involves the study of two medically significant proteins; one of which may be the basis for an early detection test for cancer, and the other is involved in the regulation of the cardio-vascular system, and may lead to better management of cardio-vascular disorders.
Human Nutrition
Tropical Animal Production
Wool Technology (<0.1%)
Interest in using the Cray to development automatic scanning techniques for assessing wool grades.
2.2 INDUSTRIAL TECHNOLOGIES
Applied Physics (5.8%, 9.3%)
A major user of the CSF. The work has been mainly in the simulation of plasma arcs, particularly in understanding electrode behaviour, both for welding, where the electrode melting is required, and for mineral processing, where minimum electrode melting is required.
Biomolecular Engineering (37.6%)
The second biggest using Division. The use of the Cray has been critical to the ‘rational drug design’ process, which is being used to develop drugs to inhibit the spread of the influenza virus. The Cray is used to study the electronic structure of enzymes and potential drugs, while the local Convex is used to study features which effect only the drug molecule (which is typically several orders of magnitude smaller than the enzyme).
Chemicals and Polymers (3.5%, 6.2%)
The Division is a steady user of the CSF, principally to develop and study the properties of polymers and polymer blends. Work in this area won the CSIRO Chairman’s medal last year.
Manufacturing Technology (<0.1%)
This Division has started to work on using simulation techniques for understanding casting processes, and the Support Group has advised on software selection. The need for access to supercomputing resources for the Australian automotive industry is now established.
In collaboration with the Division of Materials Science and Technology, the Division is using the Cray to simulate the chemical kinetics inside the plasma in the PLASCON waste disposal system.
Materials Science and Technology (2.8%, 21.7%)
This Division has been a consistent small user of the CSF. One of the recent projects is the study of micro-channel plates for the concentration, focussing and collimation of X-rays. The Cray is used for the Monte-Carlo simulation of the properties of these devices. (See also the previous entry).
2.3 INFORMATION SCIENCE AND ENGINEERING
Australia Telescope National Facility (<0.1%)
The ATNF has not been a significant user of the CSF, mainly because of the problems of transferring large amounts of data from the telescopes to the CSF.
Information Technology (0.4%, 1.2%)
The main use of the Cray by the Division has been as a program development and measurement tool for the High Performance Computation Group. The Cray has special measurement hardware found on no other computers.
Mathematics and Statistics (15.4%)
The Division’s applications have been mainly in the area of fluid flow, and the development of the FASTFLO modelling package. Specific projects have included the modelling of landslides, and the modelling of an oil-mixing problem for Shell Australia.
Radiophysics (0.1%)
RadioPhysics has used the CSF only occasionally, mainly because of the problems of transferring large amounts of data to the CSF. It has used the Cray for Monte Carlo simulations of GaAs devices.
2.4 MINERALS, ENERGY AND CONSTRUCTION
Building, Construction and Engineering (5.5%)
This Division ‘s applications have been in the area of air flow modelling, particularly related to fire safety.
Coal and Energy Technology (0.7%)
The use of the UniChem software on the Cray was vital to a project. on Fullerenes.
Exploration and Mining (<0.1%)
Mineral and Process Engineering
Mineral Products
Petroleum Resources (0.0%, 0.4%)
2.5 NATURAL RESOURCES AND ENVIRONMENT
Atmospheric Research (75.8%, 97.7%)
The major user of the Facility. Usage is not just for Climate Change Research, but also for drought research, environmental consulting, maximum precipitation studies. The ECRU at the Division won a CSIRO Medal last year for work which is highly dependent on the CSF Facility.
Environmental Mechanics (<0.1%, 0.1%)
Usage is starting to grow.
Fisheries
Oceanography (3.4%, 17.6%)
Major usage has been for the oceanographic studies associated with the Climate Change Research Program.
Water Resources (0.0, 0.5%)
Wildlife and Ecology
2.6 PLANT PRODUCTION AND PROCESSING
Entomology (0.0%, 28.3%)
Major usage was for the simulation of the storage of grain in silos.
Forest Products
Forestry
Horticulture
Plant Industry (about 8%)
The Supercomputing Support Group has worked with the Division on two projects involving genetic matching. The work was done under the Support Group’s accounts. One of the projects ceased when the person doing the work left CSIRO.
Soils (0.7%, 4.1%)
The Division is using the Cray facility to understand soil erosion processes. A predictive model at the particle-size level has been developed, which involves the solution of a set of 20 stiff partial differential equations.
Tropical Crops and Pastures (3.2%, 6.5%)
The project using the Cray involved the 3-6 month seasonal prediction of regional rainfall, based on multi-variate regression and analogues.
2.7 SUMMARY
Twenty-three out of 34 Divisions or Centres have used the Facility. Fourteen Divisions have at some stage had a peak monthly usage of over 1%, representing about 18 processor hours of the current system. CSIRO Medal-winning research has relied on the supercomputing resources available to CSIRO.
Life Sciences
The Support Group and management were concerned that although the CSF was heavily used by some research groups, the life sciences areas of CSIRO were not large users – see the above sample. To encourage usage of HPC, Marek Michalewicz organised two conferences, successfully drawing keynote speakers from overseas institutions, and publishing the proceedings in two books – “Plants to ecosystems” and “Humans to proteins”.
Staff
Here is a photograph from about 1995 with SSG staff Marek Michalewicz, Len Makin and Robert Bell with a Cray Research staff member Eva Hatzi who managed the Help Service. Behind is cherax II, serial number 1918.