Keeping up with the Growth of "Moose" Computing Demand At UVM

CIT first offered a "open system"(Unix) host in 1991. The first system, an IBM RS6000 530, was acquired using an IBM Matching Equipment Grant and was initially selected to serve compute-intensive needs of students, faculty and staff. Unix (AIX in this case) was chosen because competition in this market had lead and was leading to tremendous improvements in price/performance ratios -- especially in comparison to the existing VAX and IBM mainframe environments. This environment also allowed UVM to take advantage of a wealth of free or low-cost software available for this environment. It should be noted that the UVM's Division of Engineering, Mathematics and Business Administration Computing Facility (EMBA-CF) had already pioneered Unix as a viable operating system for meeting academic needs.

This system (dubbed "Moose" by the staff members who first installed it) proved to be very popular with students and faculty, especially as an email and Internet access host. And when funding for the VAX host dedicated to geographic information systems (GIS) expired shortly after, Moose also became the new home for GIS research and instruction. In order to keep up with the growing demands, CIT has steadily increased the power of Moose, replacing the 530 with a 580 in 1993, which was upgraded to a 590 in 1994 and followed by the purchase of an R24 in January, 1995. (See attached chart for relative capacity.) During this period the number of users had increased from a few hundred to well over 10,000 and was projected to increase by several thousand more by fall (currently about 14,000). By the spring of 1995 it became clear that though this was the most powerful RS6000 uniprocessor available at the time, it was not going to be able to keep up with the anticipated growth of demand. Since upgrading to a more powerful processor was not an option at the time, several alternatives were considered. A few of the more viable possibilities that were evaluated, but not chosen, were:

A. An IBM RS6000 4-way SP2 system was seriously considered in the spring of 1995. Although the power and scalability of this system would likely have been able to meet our needs for several years, the cost was far more than what we could afford. Although we might have been able to come up with the funds for the base system (2-4 processors), the cost of upgrades and the price performance was out-of-line with the market.

B. Building separate systems for separate communities, or based upon some division of the alphabet. Reasons for non-selection were:
(1) separate systems take additional system management and account administration resources
(2) account automation programs would have to be extensively modified to handle multiple hosts
(3) unused resources on a lightly loaded system would not be available to users on a heavily loaded system
(4) IBM technical representatives, the systems programmers and academic computing staff, whose efforts are critical to the success of any such major project, felt that there had to be a better more scalable way to meet this need.

C. Use network file system (NFS) and network file service (NIS) along with IBM load leveler software to cluster the R24 with 1 or more other RS6000s. The NIS/NFS combination had been used in the EMBA-CF to meet the needs of their students and faculty for some time. Reasons for non-selection were:
(1) There were serious concerns about security exposures introduced by NIS.
(2) NFS performance had been disappointing.
(3) The support staff were seeking a better technology that could support existing and future needs for security, scalablity and client-server architectures that could potentially unify our computing network.

D. Build a separate DCE cluster independent from Moose and migrate folks one-by-one. Reasons for non-selection were:
(1) We did not have sufficient budgetary funds.
(2) The migration from Moose to Zoo would have much more difficult and time-consuming for our users and CIT because files would have to be explicitly moved;
(3) In the event of DCE problems, the folks who had converted would not be able to easily fall back to a non-DCE Moose since the old files would not automatically be updated (as they are with the current DCE implementation)
(4) The need for additional power on Moose users was pressing.


Some things CIT has done to respond to the growing load.

€ Implemented a high-speed fiber-optic ring (FDDI) to connect hosts and stand-alone backup tape library (1995, necessary to cope with fewer operators due to budget reductions and to provide better service to researchers)

€ Connect two 8-way RS6000 J30 systems to the FDDI ring to form the Zoo cluster

€ Installed DCE 2.1 when it became available in October, 1995; attempted massive cutover November 15, 1995 which, despite considerable testing by IBM and UVM proved to be unacceptably unreliable under heavy load in our environment due to software flaws in IBM's DCE implementation. Subsequent to this ensured a period of intensive efforts from IBM and UVM technical staff to determine the problems and put together an alternative plan for migrating Moose users to the DCE environment.

€ Encouraged folks to use POP/IMAP desktop email systems such as Eudora and PC-PINE

€ Converted all our lab stations to encourage the use of PC-PINE and "MailDrop"

€ Convinced IBM to loan 512MB of memory for Moose; (at no cost to UVM);
to fly in an expert from the Austin TX AIX support center to work onsite; and
to bring together various IBM managers and DCE experts for a daily conference calls until the most serious problems were solved. (Jan-March)

€ Installed 2 new separate SMTP servers to improve performance, to increase reliability and to lighten the load on Moose

€ Moved Listproc to a lightly loaded system (aliased to LIST) to improve list performance

€ Moved Bootp to another host to improve Bootp performance and lighten Moose load

€ Converted large email lists (UVMTODAY, I-Teach, et al) to newsgroups

€ Purchased a GIS (ArcInfo) site license so that it could be run on hosts other than Moose

€ Set up a separate DCE security server to improve Zoo performance and reliability

€ Upgrade the Moose R24 to an 8-way SMP R30 processor

At this point the performance of Moose/Gnu/Zoo is excellent. CPU utilization during the day is typically 10-20% (compared to 100% earlier in the semester). Compared to January, when the daytime load was frequently over 200, the "load factor" has improved by a factor of nearly 100. We expect this load to increase as more folks, especially researchers and others with compute intensive needs, "rediscover" it. Email that was sometimes taking up to a day, now typically arrives in a few minutes or even seconds. Because it is still a very new system, we expect to see more service interruptions than with a mature, stable and obsolescent technology such as UVMVAX or UVMVM. Nonetheless, reliability has already been very much improved, and we will continue to work with the vendor to further enhance the reliability and availability of the system.

Future Efforts(with estimated schedule)
< BR>€ Continue careful migration to Zoo cluster; focus efforts on improving scheduled uptime.(now)

€ Continue to work daily with IBM support team in Austin to solve any remaining DCE bugs.(now)

€ Add Elk (another 8-way processor) to the DCE cluster (after commencement)< P>€ Implement load balancing so that folks always get the system with the lightest load (summer 96)

€ Add RAID disk array to Moose to improve reliability, availability and performance (after commencement)

€ Convert all 24 processors from 601 chip technology to substantially faster RS6000 604 technology when it becomes available from IBM at no cost. (summer 96)

€ Add memory to GNU (on order) and ELK(summer 96?)

€ Add additional processors as budget resources allow(holiday break? summer 97?)

€ Set up a separate server for managing backups (currently on Moose)

€ Possibly set up separate server for www.uvm.edu(summer 96? Never?)