CSIRO Computing History, Chapter 5
This page/post is under construction.
Last updated: 14 Aug 2020.
Robert C. Bell
Chapter 5. Joint Supercomputing Facility – the Cray Research era
- 1990-1992: Joint Supercomputing Facility; Cray Y-MP2/216 SN 1409 (cherax I), Leading Edge Technologies at Port Melbourne.
By early 1989, it was probably clear to all that CSIRO science research was not going to be well served by PAXUS, with its decision to go to an all IBM-compatible host services. Pressure was probably mounted by the Divisions with large computing needs (e.g. Atmospheric Research), for a replacement system and service for the Cyber 205 and CSIRONET. The CSIRO Policy Committee on Computing managed the policy on CSIRO computing facilities. In 1989, it set up the Supercomputing Facilities Task Force, to decide on follow-on facilities from the Cyber 205.
This was set up in early 1989, and was chaired by Mike Coulthard from CSIRO Applied Geomechanics. Other members of the team included Charles Johnson from Materials Science and Engineering (name at the time?), Bob Smart from the Division of Information Technology, Robert Bell from the Division of Atmospheric Research, possibly Mike Kesteven from Radiophysics, and Peter Price from the ACSG. CSIRO issued a call for expressions of interest, asked for a replacement for the 205, that had to be faster, run UNIX, had 10 Gbyte of disc storage, with a few other criteria.
Benchmarks were developed to allow vendors to demonstrate performance.
Proposals were received from CDC (ETA), Convex, Cray Research (one for an X-MP, one for a Y-MP in conjunction with Leading Edge Technologies), IBM and Fujitsu.
Proposals from Cray and Convex were short-listed, and Bob Smart and Rob Bell left on 29th July 1989 to visit Purdue University, CDC, Cray Research, Convex and the National Center for Atmospheric Research, to carry out benchmarking, and to gather information.
At Purdue, Bob and Rob evaluated an ETA-10 running UNIX. After about half the benchmarks didn’t even compile, it was clear that although the ETA-10 might have been an attractive follow-on from the 205, that this was a lost cause, with CDC pulling the plug on the ETA enterprise in April 1989. A visit to CDC in Minneapolis gave no further assurance.
Bob and Rob then visited Cray Research, Convex and NCAR. (Still jet-lagged after travelling on 4 flights in one day on a long Saturday, Bob and Rob arrived at Cray on the following Tuesday afternoon to find that benchmarking was scheduled from 10 PM to 2 AM!) At Convex, performance issues became evident in access to memory. There also appeared to be a fault in the floating-point arithmetic.
Two propositions were then put to the PCC by the SFTF: one a proposal from Convex for several of its C series vector machines, and a proposal from Cray Research and Leading Edge Technologies to share in a Cray Y-MP. (A proposal from Cray Research for a CSIRO-only Cray X-MP was not recommended.)
At a meeting in August 1989, the PCC decided on the Cray Research/Leading Edge Technologies shared Cray Y-MP proposal. This happened while Bob Frater, the PCC chair, was absent from the meeting to attend an event at CSIRO Radiophysics, and Alan Reid took over as chair.) Minter Ellison was called on to do due diligence on LET.
See https://news.google.com/newspapers?nid=1300&dat=19891113&id=mQAzAAAAIBAJ&sjid=pJEDAAAAIBAJ&pg=5570,5349828&hl=en for a report in “The Age” from 13 November 1989 about the setting up of the JSF and the Strategic Research Foundation.
(In 1989, Rob Bell analysed the CSIRO usage of the CDC Cyber 205, and found that 91% of the usage came from researchers based in the Melbourne area: given the state of network bandwidths available at the time at the dawn of AARNet, the decision was made to base the supercomputer in Melbourne).
Then, negotiations began with LET to establish the Joint Supercomputer Facility service and the machine room to house the Y-MP. (LET already had an X-MP, and had an established seismic data processing business with Tensor Pacific.).
- The Y-MP was installed at Leading Edge Technologies premises in 283 Normanby Road Port Melbourne in February/March 1990, with acceptance testing (16 hours per day) being run in March. The system was opened by Barry Jones on the day before a Federal Election, and was named “cherax”. The name was chosen as a play on cherax – the scientific name for the yabby, the Australian Cray :-). We initially pronounced cherax with a starting sound like “chips”, but at the opening of the first CSIRO Cray on 23rd March 1990, Barry Jones AC FAA FACE FAHA FASSA FTSE, then Minister for Science in the Hawke government, remarked that “since the name is derived from the Greek, the pronunciation is cherax, as in chemistry” (with a hard “k” sound). We stood corrected.
- Here is a picture of the Y-MP. In the foreground is the input/output system (IOS), based on Cray X-MP technology, and in Cray’s traditional shape with ‘seats’ around a ‘circular’ cabinet. The next part is the Y-MP itself, in a rectangular cabinet. Further back still is another red cabinet, possibly the refrigerant unit, while behind that, against the wall, is an air handler. On the left is one of the front-end workstations. On the right is a row of DD-49 disc units – each unit had a capacity of 2 Gbyte, but with access times comparable with current disc drives. CSIRO had one unit for its entire on-line storage.
1990-1997: Supercomputing Support Group
This was formed in 1990, with Robert Bell being appointed as leader (initially part-time on secondment from the Division of Atmospheric Research) from May, and recruitment led to the addition of Len Makin, Marek Michalewicz and Simon McClenahan. The group was co-located with CSIRO Division of Information Technology at its Melbourne laboratory, firstly at 55 Barry St, then from June 1991 at 723 Swanston St (with University of Melbourne, RMIT, IBM, ACCI, CITRI, etc), Carlton. The SSG reported to John O’Callaghan, Chief of the Division of Information Technology. LET provided Help Desk services, with Eva Hatzi being appointed, and system and other management.
- On 14th November 1991, CSIRO initiated the Data Migration Facility on its home filesystem on cherax, with data being written to two 3480-compatible tape cartridges (each capable of holding about 240 Mbyte). This provided a hierarchical storage management system, restoring to CSIRO what it had lost in the mid-1970s. However, the tapes had to be mounted by operators, but as LET had a large tape-base operation to support field data processing for the seismic industry, the load was probably not too great. (LET did get access to new cartridge tape drives, which CSIRO purchased). CSIRO staff can see a history of what became known as the CSIRO Scientific Computing Data Store here .
- By 1991, things were starting to turn bad for LET. Martin Sachs had established the company in response to a suggestion from Barry Jones about how he could assist Australia in gratitude for what he had been able to achieve as a migrant. However, economic conditions soured (as Treasurer in 1990, Paul Keating famously described the 1990s recession as “the recession we had to have“), and a planned merger by LET with Boeing to host supercomputing services in Australia failed to come to fruition, possibly because the ‘offset’ arrangements allowing Australia to import Boeing planes in exchange for Boeing contributing to the Australian economy faltered.
- In early 1992, Rob Bell was asked by Greg Batchelor, General Manager, CSIRO ITSB (the equivalent of CSIRO’s CIO), what could be done to protect the CSIRO data at LET. A DMF backup procedure was started, which took a copy of CSIRO’s data offset at monthly intervals.
- In May 1992 Cray Research (Australia) took over the running of the system to continue to provide a service to CSIRO.
- It was subsequently disclosed that the Cray Research engineers had installed a secret network link to cherax, with the equipment hidden in the underfloor, to help protect the asset.
- Cray Research stepped in, and prepared a proposal to CSIRO for CSIRO to acquire another Cray Y-MP to replace the current one, for it to be hosted at the University of Melbourne, for Cray Research and the University to provide hosting and management services, and for about 10% of the resources to be made available to University users and Cray Research clients. See the next chapter.
- LET did become bankrupt – Martin Sachs was rumoured to have lost several mission dollars, and the liquidators sold off the equipment, most of which went for a song – the value was in the service around it. Rob. Bell followed some to the auction houses, and purchased about 3500 cartridge tapes at about 20c each, compared with a new price of about $7 each.
LET’s managing director was Tom Kopp, it employed Peter Boek as general manager, Mark Watson as systems analyst, Richard Hume, and Eva Hatzi as Help Desk person (who became a favourite with many of the CSIRO users), Ken Ho Le as systems administrator and many operators.
Here’s a picture of the Y-MP mainframe with the skins off, showing the hoses containing coolant to the processor boards, some of the wiring which provided the high-speed interconnect between the processors and the many banks of memory; and, Eva Hatzi, Judy Mercure (DIT Communications Manager) and Rob. Bell.
- Funding arrangements
After a battle, funding for the CSIRO part of the JSF was agreed to be taken ‘off the top’ of the CSIRO budget. However, a write-back process was required to be done each year, when the notional cost of any usage was added to the budget of Divisions, and then taken away again. This had the purpose of reflecting in Divisional budgets the full cost of their research, including supercomputer usage. It also upset Divisions’ calculation of their external earnings ratio, since the above process reduced the earnings ratio.
CSIRO Management insisted that Divisions would have to pay for the facility, but after the agreement for off-the-top funding, this became a requirement for Divisions to pay for any enhancements of extras, such as software (!?!) or auxiliary hardware.
A Supercomputing Facilities Users Management Committee (SFUMC) was set up, consisting mainly of about 6 representatives from Divisions, and this had the responsibility to manage the funding arrangements. A share scheme was set up, whereby Divisions were asked to contribute to a Development Fund, and the shares (in the Fair-Share Scheduler on the system) were set in proportion to the contributions. (This was modelled, at the suggestion of Trevor Hales, on a scheme operating in Europe for funding shared networks). A floor price was in effect, with the Division of Atmospheric Research agreeing to carry forward its previous spending on Csironet and the Cyber 205 of $10k per month, and a minimum charge of $100 per month was set. This enabled funding to be built up, and software of general use was purchased from this fund, as well as storage enhancement such as the cartridge tape drives. Changes to the scheduling were needed and these became available from Cray (the Fair-share NQS): whereas the original Fair Share scheduling dealt only with contention between running processes on a shared system, we needed the share system to influence which jobs were to be started. A final tweak to overcome the problem of small contributors being swamped by large contributors was suggested by the Institute Director, Bob Frater, based on some international bodies, where voting rights were set in proportion to the square root of the population. We set shares in proportion to the square root of the contributions and this worked satisfactorily (although no user is ever happy for his or her jobs to have to wait to run!)