Sidebar 6: Robert Bell personal computing history

By February 8th, 2021

These pages attempt to give some of the history of CSIRO’s use of computing in its research, focussing mainly on the large shared systems and services.

Sidebar 6: Robert Bell personal computing history

Last updated: 25 Feb 2021.
Robert C. Bell

Introduction

In 2003, I gave a talk at a Dinner hosted by RMIT Engineering, and talked about my career in terms of Technology: Mathematics, Meteorology, Machines.

Early life

I was born in Hobart in 1949, and lived there until September 1951 when our family (Father Powell, Mother Gwenyth, brothers Alan and John, and sister Margaret) moved to Melbourne to the suburb of Bentleigh.  My family had originally come from NSW – my Mother from Terara near Nowra, and my Father from Sydney.

My Father was the Accountant at the main Melbourne branch of the Commonwealth Savings Bank at 8 Elizabeth St, and was promoted to Inspector before retiring in 1973.  My Mother was never in paid employment, working on the family dairy farm until she was married.  But there was a history of scholarship in her family, with one Uncle being a school inspector, and another being a civil engineer, being Chief Engineer of the NSW Railways and a president of the Institute of Engineers.

Maths and School

Maths and numbers seemed to be around us.  Prior to my going to school, I could tell you the number of each of the animal cards in my collection from the cereal packets.  I attended Bentleigh West primary school, and then Brighton High School.  In grade 4, after parent/teacher interviews with Mr Curlis, I was encouraged to learn more about Maths, and my parents bought an additional textbook.

While at primary school, we acquired an old typewriter from my Uncle Albert’s business (C. Bell & Son).  My Father did the monthly accounts for the business, and he was rewarded by ‘presents’ from Uncle Albert.  I set to to type up a multiplication table.  I don’t remember how many rows it had, but think it went to at least 20.  I typed up successive pages to the right as the table was extended, and I do remember that I reached 60.  Pages were sticky-taped together, and rolled up like a scroll.

I won a Mothers’ Club scholarship from Bentleigh West.

I started High school in 1961, and soon became interested in the weather through the Geography syllabus.  I set up my own weather station at home in September 1961, and started taking twice-daily observations, which continued until I left home in 1974 (the records are on paper at home).  Here is an article I wrote which was published in the school magazine Voyager in 1962.
I continued to be good at Maths.  In year 11 and 12, I benefitted from having Mrs Frietag as Maths teacher.  She had a laconic style, and I remember her handing back our exam papers one time, and she reluctantly saying that she couldn’t find anything wrong with my effort!  In the year 12 mid-year exams, in one of the Maths subjects, she failed nearly everyone – I think she had used an old end-year exam paper, and we hadn’t covered all the topics yet.  There were protests, and the results were scaled so more students passed.  I was a bit annoyed that she arbitrarily gave me a score of 95 when the scaling-up produced a result for me around 110 out of 100.

At the end of year 11 in 1965, I was called in by the Principal to encourage me to attend a summer Maths camp at Somers.  He said that I could win an exhibition.  I attended the camp, and subsequently did gain a general exhibition, awarded for the top 30 students in the state.  I also won a state Senior Scholarship, which provided money!

My brother Alan joined the Department of Defence in early 1960 after completing a science degree at the University of Melbourne.  (I found out years later that he had done a subject on computing, and had used CSIRAC.)  In September 1961, he left home to work for a year at GCHQ in Cheltenham, UK. We (and especially me) at the time had no idea of what he worked on.  We knew he was in Defence Signals.  Part way through his year for a few weeks, his aerogramme letters started coming from Bletchley Park instead of Cheltenham, and on his way home in November/December 1962, we received letters from Silver Springs, Maryland.

In about 1963-1964, he started to teach me about computers, with a ‘model’ showing pigeon holes as place where numbers and instructions could be stored.

In 1965, I attended an ISCF Science Camp (Inter-School Christian Fellowship) at Belgrave Heights; the camp was run by the Graduates Fellowship, including my brother Alan.  We were taught the beginnings of Fortran programming, and he took the programs we wrote to get them punched onto cards, and compiled and ran the programs (on a CDC 3400 at DSD that was under acceptance testing I found out later).  Thus I wrote my first computer program in 1965: the program calculated the period of a pendulum for various lengths using the well known formula.

In 1966, my sister Margaret commenced work with the Bureau of Meteorology as a programmer in training, having completed a BSc(Hons) majoring in Applied Maths at Monash University.

University and Vacation Employment

I was keen on a career in Meteorology, and enrolled in a BSc at Monash University, commencing in 1967.  During 1967, I wrote to CSIRO Division of Meteorological Physics at Aspendale, and was subsequently employed as a vacation student for 1967-68.  I worked for Reg Clarke. I spent the first hour of each day punching in data from the Wangara Expedition onto cards, and the subsequent hours with a large mechanical desk calculator (the sort with ten rows of ten buttons), and log book and pencil and paper, calculating u.cos(theta) + v.sin(theta) from more Wangara observations.  It’s great to see that data from Wangara is now publicly accessible through ARDC – I thought it might be lost!  https://researchdata.edu.au/wangara-experiment-boundary-layer/681872
(However, not all the data was published, and I suspect the raw data is gone.)

I also learnt where the off switch was in the radar tower, as a safety measure to cover the absence of other staff.  Andreij Berson was studying dry cold fronts, which showed up on radar for unknown reasons: I can remember his scanning an approaching dry cold front with binoculars and shouting “birds!”. See https://publications.csiro.au/publications/publication/PIprocite:43a9d45f-bc88-4d4b-af4c-801773391cff/BTauthor/BVberson,%20f.%20a./RP1/RS25/RORECENT/RFAuthor=berson%2C%20f.%20a./STsearch-by-keyword/LIBRO/RI7/RT38

I also started reading from the library at Aspendale A Guide to FORTRAN Programming 1961 by Daniel D. McCracken, and tried writing a simple program to be run on the DCR CDC 3200 at Clayton – the program didn’t work, because I didn’t understand array dimensioning.
Fortunately in 1968, I studied numerical analysis and Fortran programming at Monash, with John O. Murphy lecturing, and gained more understanding.

The next year, I was again successful in gaining vacation employment at Aspendale.  This time, Reg Clarke assigned me to write a program to model a sea breeze, using the equations derived in a paper (Estoque, M. A. A theoretical investigation of the sea breeze https://doi.org/10.1002/qj.49708737203), and Reg’s own boundary layer parameterisations.  I tried to make progress, but had no idea of programmability, and foolishly programmed everything with explicit constants in a premature attempt for speed.  Despite a few attempts later in 1969 to get the program working, it never did.  I did generate successful routines to solve some of the parameterisations.

In 1969, only one of my units at Monash involved computing – Astrophysics.

I applied to CSIRO to work at Aspendale in the next vacation, but was instead offered a position at the Commonwealth Meteorology Research Centre –  a joint CSIRO/Bureau of Meteorology centre.  I commenced there working for Doug Gauntlett in the IOOF building in McKenzie St Melbourne.  Doug asked me to investigate solvers for Helmholtz equations, which were required at each time-step of the current weather forecasting models.  In particular, to investigate the alternating-direction implicit (ADI) method, which was developed by the nuclear research community.   I built a framework to test various algorithms, including successive over-relaxation, a method developed by Ross Maine, and ADI.  ADI proved to be the fastest in the cases under investigation, where a good first guess was available, such as in a time-stepping model where the field from the previous time-step was likely to be close to the required solution for the current time step, and absolute accuracy was not needed.  (It turned out that a fellow student, David Bover, on the same vacation was working on ADI for the ARL.)  During this vacation, I used the Bureau’s IBM 360/65 systems, and so learnt some JCL.

I did no computing during my honours year, which probably helped, and graduated with first class honours in Applied Maths.

I again worked at CMRC in 1970-71, and wrote a report on the work: Comparisons between explicit and semi-implicit time differencing schemes for simple atmospheric models

Post-graduate

I commenced a PhD in Applied Mathematics at Monash University in early 1971, with Roger K. G. Smith as supervisor.  I wanted to do something meteorological with computing, but Roger suggested doing work on quasi-geostrophic models of ocean circulation.  Quasi-geostrophic equations were an earlier (successful) simplification of the equations governing the flow of the atmosphere, when the earth’s rotation dominated the forces acting on the atmosphere.  I was not happy about oceans rather than atmosphere, but started the work, and did build a successful model, which did show the main features of large-scale oceanic flow.  I used the Monash CDC 3200 for this work.  Unfortunately for me, Roger Smith left for the University of Edinburgh, and I did not follow.  Bruce Morton took over my supervision, and suggested looking at stromatolites in Shark Bay WA, but I did not get far with this.  Roger Smith returned to Monash, I wrote a more accurate quasi-geostrophic model which I ran on the Monash Burroughs 6700. But, the lure of computation distracted me from the PhD.

I did learn a lot about computing, and was appointed as post-graduate representative on the Monash Computing Advisory Committee, at a time when replacements for the CDC 3200 were being considered.  There were proposals from CDC, Burroughs, IBM, UNIVAC and one other (Honeywell?), but it became clear that the Computing Centre was committed to buying the Burroughs 6700 as a successor to the Burroughs 5500.  One of the Applied Maths professors likened it to a university library, on discovering that it could earn money from lending romantic novels to the community, threw out all the science journals and texts and bought more novels!  The Burroughs machines acted as backups for the Hospital computing services.

CSIRO Aspendale

I accepted a 3-year appointment at Aspendale, starting in late 1974, and there commenced developing a model of airflow in support of the Latrobe Valley Study.  I eventually did finish my PhD, but was not making much progress with the airflow model.  People like Peter Manins tried to help me in my research career, but I think I was too proud to accept advice.  In 1977, the Chief, Brian Tucker, offered me an indefinite appointment as an Experimental Officer, which I was grateful to accept – a position allowing me to support researchers by doing computation for them, and I had found my niche.

ITCE

In October 1976, the Division hosted the International Turbulence Comparison Experiment (ITCE) at Conargo, NSW, one of the flattest areas on earth!  Three weeks before the start, I was asked to help with the computing side, working with Neil Bacon and Graham Rutter.  This involved writing programs to deal with the acquisition and calibration of the data, and was to be run on an HP 21MX mini-computer.  I spent about 10 days in Deniliquin/Conargo helping to set up the computing services in a caravan.
         PROGRAM WA
C———————————————————————–
C
C        WA IS THE FIRST OF THREE PROGRAMS TO ANALYSE I. T. C. E. CORE
C        DATE FROM MAGNETIC TAPE
C        THE MAIN STAGES OF WA ARE:
C        1. SETUP AND INITIALIZATION
C        2. INPUT OF FUNCTION SPECIFICATIONS FROM PAPER TAPE FROM THE
C           21MX.
C        3. INPUT OF SPECTRA  SPECIFICATIONS FROM PAPER TAPE FROM THE
C           21MX.
C        4. PROCESSING OF BLOCKS OF DATA FROM MAGNETIC TAPE. THIS STAGE
C           CONSISTS OF –
C           A. INPUT FROM MAGNETIC TAPE.
C           B. CONVERSION TO VOLTAGES.
C           C. SELECTING THE CORRECT SUBROUTINE FOR CALIBRATION.
C           D. COLLECTING SUMS FOR AVERAGING, ETC.
C           E. OUTPUTTING REQUIRED CALIBRATED DATA TO DISC FOR SPECTRA.
C        5. CALCULATION AND PRINTING OF AVERAGES, ETC.
C        6. OUTPUT OF CONTROLLING DATA AND AVERAGES, ETC. FOR WB AND WC.
C
C        NOTE. THROUGHOUT THIS PROGRAM, THE WORDS FUNCTION AND
C              SUBROUTINE ARE BOTH USED TO DESCRIBE THE EXPERIMENTER-
C              SUPPLIED SUBROUTINES.
One of the surprises to me was when I ran a program to calculate means and variances from the data, to find that I had negative variances!  I used a well-known formula for variances which allowed a single pass through the data:
instead of the mathematically equivalent:
The 32-bit floating-point arithmetic on the HP-21MX did not have enough precision to avoid catastrophic cancellation that the first formula allowed.  I later researched summation algorithms (Kahan and others), and developed block algorithms which provided high accuracy for the calculation of means and variances in a single pass (unpublished).

Collaborations

When I returned from ITCE, I found Rory Thompson sitting at my desk.  I worked with Rory Thompson (my worst time in CSIRO – I feared him, for good reason as we found out later), Angus McEwan, Allan Plumb (I programmed the first successful model of the Quasi-Biennial Oscillation for Allan), Peter Webster, Peter Baines and latterly Jorgen Frederiksen, with whom I had a productive partnership over several years during the time he won the David Rivett medal.   He kindly made me joint author on several papers.

UK Met Office visit

In 1983-84, I visited the UK Meteorological Office for a period of six months to gain early experience with a Cyber 205, and to begin the porting of CSIRO codes to it.  More details are given here.  See also Csironet News no. 178, August 1984 – Cyber 205 experiences – R. Bell.

Jorgen Frederiksen

One of the projects with Jorgen involved trying to improve code that he had that looked at atmospheric stability – fastest growing modes, blocking, etc.  I found that over 90% of the run time was in setting up interaction coefficients, and less than 10% of the time was spent in solving the eigenvalue problem.  Furthermore, I found that the interaction coefficients could be calculated separately, and once only, and saved.  This led to a huge speed-up, and allowed much larger problems to be tackled.

Another problem involved computing ensembles, and I was able to vectorise the code for the Cyber 205 over the ensemble members, to get great speed-up.

DCR/Csironet interactions

During these years, I tried to take advantage of every useful facility that DCR/Csironet provided to support the scientific effort.  I used and promoted the use of the source code control system UPDATE, I could write Ed box programs, I promoted the use of standard Fortran, I built libraries of useful code (a set of routines for solving block tri-diagonal systems, used in the QBO work, and by Peter Webster) and wrote utilities to help manage data holdings.  I had two stints working in the User Assistance Section in Canberra.  I started writing an anonymous column for Csironet News (Stings and Things by Scorpio.)

DAR Computing Group

In about 1986, the Chief asked me to consider taking on the role of Computing Group Leader, which I had done on a temporary basis in June-August 1985.  I accepted the position, and started in March 1987.  Tony Eccleston joined the group as well, with the existing staff of Graham Rutter, Jill Walker and Hartmut Rabich.  Staff issues dominated, as we sought to establish a new UNIX-based local computing environment.  After going out to tender, running benchmarks, and evaluating proposals, Silicon Graphics won over Sun and HP (and maybe others) with a clear performance advantage.  A UNIX server was installed for general computing use.

SFTF

With the privatisation of Csironet underway, and no clear path for a successor to the Cyber 205 for scientific computing work, in 1989 the CSIRO Policy Committee on Computing set up the Supercomputing Facilities Task Force (SFTF), to decide on follow-on facilities from the Cyber 205.  See Chapter 5 .
I was heavily involved and managed the benchmarks that were assembled from codes from several CSIRO Divisions, along with some specific benchmarks to test key areas such as memory performance.  I travelled with Bob Smart to the USA for two weeks to undertake benchmarking and to explore options.   This was our first visit to the USA.

When decision-time came in August 1989 at the PCC, my Chief, Brian Tucker, insisted that I should be present along with Mike Coulthard, who chaired the SFTF.  The PCC decided on the Cray Research/Leading Edge Technologies shared Cray Y-MP proposal.

JSF and SSG

I was then heavily involved in setting up the partnership (Joint Supercomputing Facility) with LET in Port Melbourne, establishing the service, and had sole responsibility for running the acceptance tests in March 1990 – 16 hours per day re-running the benchmarks for about a week on cherax, the name we gave the system (SN1409) and subsequent platforms.  I was not present all the time, but relied on Cray Research staff to start the benchmarks at 8 AM each day, and terminate them at midnight.

I continued to help with the setting up of the service, on one occasion accompanying 3 staff from Aspendale to visit LET with a magnetic tape to set up their programs, prior to acceptable networking facilities being set up by Bob Smart.
The position of Supercomputing Support Group leader was advertised, to be based at the Division of Information Technology at 55 Barry St Carlton, and I was successful in gaining the job, starting (initially for 3 days per week on secondment from DAR) in May 1990.  I had by then relinquished the Computing Group Leader position at Aspendale, to concentrate on the establishment of the Joint Supercomputing Facility.  I was joined by Marek Michalewicz, Simon McClenahan, and Len Makin to form the group of four.

In the second half of 1990 I was involved (with Peter Boek from LET and Peter Grimes of Cray Research) on a roadshow to all the major CSIRO sites (all capitals, and Townsville) to publicise the new service.  The uptake was good in several Divisions of CSIRO, but those with computing needs which could be met with existing PCs, workstations and Divisional facilities (including mini-supercomputers), did not make great use of the JSF.

At the end of 1990, I presented the paper Benchmarking to Buy at the Third Australian Supercomputer Conference in Melbourne, based on our experiences.

CUG and DMF

In April-May 1991, I was fortunate to be able to attend my first Cray User Group meeting – in London, and then visit several other supercomputing sites, including the UK Met Office, ECMWF, NCSA, NCAR and SDSC.  At CUG, I had fruitful meetings with Charles Grassl and others, as I presented results from the benchmarking of the memory subsystems of various computers.   These results illustrated the large memory bandwidth of the Cray Research vector systems of the time, compared with cache-based systems systems.  I also learnt about Cray Research’s Data Migration Facility, which would become pivotal in CSIRO’s subsequent scientific computing storage services.

I later served two terms on the CUG Board of Directors as Asia/Pacific Representative, and presented two papers: “Seven Years and Seven Lessons with DMF”, and a joint paper with Guy Robinson comparing the Cray and NEC vector systems (Cray was marketing the NEC SX-6 as the Cray SX-6 at the time).

DMF

We quickly found that the Cray Y-MP turned a compute problem into a data storage problem – the original system had 1 Gbyte of disc storage (DD-49s) for the CSIRO home area, and the only option for more storage was manually mounted 9-track magnetic tapes.  LET wished to acquire cartridge tape drives for its seismic data processing business, and CSIRO assisted in a joint purchase of such drives from StorageTek.  This set up minimal requirements to invoke DMF on the CSIRO /home area, which was done on 14th November 1991, so that more dormant files would be copied to two tapes, and subsequently have their data removed from disc, but able to be restored from tape when referenced.  This took some getting used to for users, but in the end the illusion of near-infinite storage capacity was compelling, and skilled users learnt how to drive pipelines of recall and process.  Thus, I had (unwittingly at the time) re-created the DAD Document Region functionality on the CDC 3600, with automatic migration to tape, and recall when required.

CSF

At the end of 1991, economic circumstances put LET under threat – see Chapter 5.  DMF allowed us to institute an off-site backup regime, just in case.  Cray Research put a proposal to CSIRO to establish a new service, in conjunction with and situated at the University of Melbourne, with a Cray Research Y-MP 3/464, and service started there on 1st August 1992, with the data being transferred from the previous Y-MP using the DMF off-site backup.  This commenced what we called the CSIRO Supercomputing Facility (CSF).

Cost write-back, the Share Scheme and the Development Fund: STK Tape library.

Back in 1990, funding for the Supercomputing Facility was constrained, and senior management was keen to have the costs attributed to Divisions.  Two mechanisms were put in place.  One, called the write-back, was applied at the end of each financial year.  The total costs of the facility were apportioned to Divisions based on their usage, an extra appropriation amount equal to the cost was given to each Division (from the Institute Funds for the Supercomputing Facility), and then taken away from Divisions as expenditure.  This achieved the costs of the facility being attributed to Divisions, but changed (for the worse) Divisions’ ratio of external earnings to appropriation funds, thus making it harder to meet the target (which was about 30% at this time).

The second scheme was called the Share Scheme.  The idea came from a report by Trevor Hales of DIT of a funding mechanism used for a European network, where each contributor received a share of the resources proportional to their contribution.  I set up a share scheme, inviting Divisions to contribute monthly, with a minimum contribution of $100 and a ‘floor-price’ from the Division of Atmospheric Research which contributed $10,000 per month (re-directing its spending on Csironet to this share scheme).  The contributions went into a Development Fund, which was used to buy items to enhance the facility, e.g. commercial software, tape drives, and, in June 1993, a StorageTek Automatic Tape Library holding up to 6000 tape cartridges.  We set shares in the Fair Share Scheduler on the Crays for the CSIRO Divisions proportional to the contributions.  Later, the batch scheduler was enhanced to consider the shares when deciding which jobs to start.  There was a problem with Divisions with small needs and contributions getting access, but this was solved following a suggestion from the Institute Director Bob Frater, who reported that some international bodies set voting rights for countries proportional to the square root of the population.  This was implemented, to allow reasonable access for Divisions with low shares.

CSF Collaboration

The CSF seemed to work: CSIRO provided the bulk of the funding and support staff, Cray Research managed the maintenance, and provide a systems administrator (Peter Edwards) and a Help Desk person (Eva Hatzi from LET).  The University of Melbourne hosted the system and provided operators for two-shifts per weekday (and maybe some on weekends), etc.  There were regular meetings between the parties, made easier by the fact that my brother Alan headed the University’s computing services at the time.  A utilisation of 98.5% was achieved over the life of the system, with the utilisation being boosted after the installation of the tape library – my analysis showed that the automation paid for itself in reduced idle time over a year or so.

Utilities – the tardir family

In March 1992 as users were starting to exercise DMF on the /home filesystem on cherax, it was apparent that recalling many little files took a long time (especially with manual tape mounts) and over-loaded the system.  I started a set of utilities, tardir, untardir and gettardir, to allow users to consolidate the contents of a directory into a tar (“Tape ARchive) file on disc, which would be likely to be migrated to tape, but also save a listing of the directory contents in a smaller file which would be more likely to stay on-line, as very small files were not being removed from the disc.   This provided order of magnitude speedups for some workflows, and allowed users to scan the contents of a an off-line file before requesting recall.  The untardir reversed the process, while gettardir allowed selective recalls.  The tardir utilities remain in use today (2021), particularly in the “external backups” procedures developed by CSIRO Scientific Computing.

America Cup

Around 1993-95, the CSF with Cray Research hosted development work on cherax by the designer of the America Cup syndicate.  The designer, who had to be based in Australia, was offered time on Sun systems, but insisted on access to a Cray system.  With the money that came from this, a fourth processor was acquired, worth about $A250k.

Bureau of Meteorology

The Bureau had also acquired a Cray Y-MP.  In about 1996, the incoming CSIRO CEO, Malcolm McIntosh, reportedly asked, “What are we doing about supercomputing: I’m prepared to sign off on a joint facility with the Bureau.”  This was enough to get the management and support staff of both organisations working together to bring this about.   The technical team drew up specifications for a joint system, and went to tender: three companies responded: Fujitsu, NEC and Cray Research.  One of the contentious parts was that I specified Fortran90-compliant compilers for the CSIRO benchmarks, and the Cray T90 outperformed the NEX SX-4 on these tests, but the Bureau didn’t specify Fortran90-compliance, and the NEC bid was better on the Bureau’s tests.  Software quality was always difficult to measure, and the things we could measure came to dominate the evaluation, as often happens.  In the end, NEC won the contract.  (Some years later, a Cray Research employee noted that we had dodged a bullet with the T90 – it was unreliable.  I remember a colleague from CEA France, Claude Lecouvre, reporting seeing Cray engineers in full PPE in CEA’s machine room, diagnosing an uncontrolled leak of fluorinert, which released poisonous gases if over-heated.)
Ini parallel with the tender evaluation, work was underway to draw up an agreement between CSIRO and the Bureau, which became the HPCCC (High Performance Computing and Communications Centre) allowing for the Bureau to be the owner of the shared equipment, for the Bureau to host the joint support staff on its premises, and for auxiliary systems to be co-located.  Steve Munro from the Bureau was the initial manager, and I was appointed deputy manager (although I couldn’t act as manager, as I did not have Bureau financial delegations).
Staff moved into newly fitted-out premises on the 24th Floor of the existing Bureau Head Office at 150 Lonsdale St Melbourne in September 1997, with 8 staff members initially.
The SX-4 arrived in September 1997, and was installed in the Bureau’s Central Computing Facility (CCF) on the first floor, requiring some tricky crane-work.
Although the HPCCC awarded the contract to NEC, there were two aspects of its proposal that were considered deficient, and NEC agreed to under take developments to cover these aspects: scheduling and data management.   Rob Thurling of the Bureau and I drew up specifications for enahncements.

The first problem was the lack of a ‘political’ fair-share scheduler.  The HPCCC need the system to respond rapidly to operational work, but allow background work to fill the machine, and also to ensure that each party received its 50% share of resources.  NEC set to work and wrote the Advanced Resource Scheduler (ARS), but after John O’Callaghan pointed out what the abbreviation ARS led to, the name was changed to Enhanced Resource Scheduler (ERS).   An early version was available by the end of 1997, and this grew into a product which was later enhanced by NEC to support multi-node operation for the SX-6, allowing for preemption by high priority jobs, with checkpointing, migration to other nodes and restart for lower priority work.  Other NEC SX sites used the product.  There were over a hundred tunable parameters, and NEC continued to enhance the product to meet our suggestions through the life of the systems.  (Jeroen van den Muyzenberg wrote one addition to implement a request from me.  CSIRO liked to over-commit its nodes that weren’t running multi-CPU or multi-node jobs with single-CPU jobs, to maximise utilisation – otherwise, idle CPU time would accumulate when jobs were doing i/o for example.  The addition was to tweak the process priorities for jobs (about every 5 minutes), giving higher priority to the jobs which were proportionally closest to their finishing time, and giving lower priority to jobs just starting.  This resulted in jobs starting slowly, but accelerating as they neared completion.  The HPCCC ran ERS on the NEC SX systems until their end in 2010.

The second problem was data management.  Both CSIRO and the Bureau were running DMF on Cray Research systems – a J90 for the Bureau.  NEC proposed the SX-Backstore product as a replacement to provide an integrated compute and data solution.  There followed a development process by NEC to meet the specifications that we gave for a workable production HSM.

However, when testing was undertaken on site, a serious issue arose.  One of the key requirements for a HSM is protection of the data, including restoration of all the files and information in the event of a crash and loss of the underlying filesystem (there was such a crash around that time on CSIRO’s Cray J916se system, with recovery being provided by the CSIRO systems administrator at the time, Virginia Norling, and taking 30 hours for millions of files).  Ann Eblen set up a test file system on the SX-4 with about 30,000 files managed by SX-Backstore, took a dump to disc (about 5 minutes) and to tape (about 6 minutes), wiped the disc, and then set SX-Backstore to restore the filesystem.  This took 46 hours, a totally unacceptable time – it looked like there was an n-squared dependency in the restore process.   NEC found that a complete re-engineering would be needed to solve the problem, and the HPCCC agreed to accept from NEC compensation for the failure to deliver.

The Bureau had by this stage moved from an Epoch to a SAM-FS HSM, while CSIRO continued with DMF on a Cray J916se, which was acquired in September 1997 and installed in the Bureau’s CCF as an associated facility.  This system was acquired at my insistence.  The J916 had a HiPPI connection to the NEC SX-4, giving far higher bandwidth than the Bureau provided for its system with just Ethernet.

 

Back to contents