High Performance Computing

Where will the conversation continue?

SU-HPC Group - su-hpc-group@lists.stanford.edu

Notes

HPC - What Is It?
=================
"When your problem outgrows what you can run on your personal system"
- Parallel systems, specialized hardware
Where are the boundaries? Very unclear; is it based on number of systems?
Methods of invocation? Number of processors?
Users towards HPC (parallelism)
===============================
Similar problem to "running on multicore" issues -> we're sharing problems
with other domains
How much can your work be split up? Separate jobs; parallel jobs
Bridging the gap with users that don't know what, say, Python is; that's
hard!
Maybe there should be "Research 101" that's like Stanford 101 - shows
researchers that parallel resources exist, "Excel does not solve
everything"
SRCF
====
Stanford Research Computing Facility
Bill talked about it this morning
Shared facility - ITS, Medicine, Dean of Research, SLAC - central compute
facility for large compute clusters for HPC (3MW), will cost $60-80M
Working on a shared model with Provost funding; there will be per-rack
costs as a colocation facility (fairly cheap)
- License issues? TBD. (Example: there are systems still in Sweet that
are only there to make software licensing happy.)
Level II - sysadmin support included, billed at the rack, provost
partially pays for the support
Level III - condo model. You buy into the "shared" cluster, and get
access to other people that also bought into the condo model. There will
be an initial "seed" of clusters. Linux-based, Tier 1 vendor, single
interconnect type
(Interesting discussions about GSB-ITS relationship issues.)
In this, it will be research computing *only*. Does not replace Forsythe.
Virtualization
==============
Can be run *on top* of a cluster.
Windows is useful for financial stuff.
Useful for running old compiled binaries, Windows, etc.
Related to cloud stuff.
Separate from enterprise virtualization.
There may be need for a separate Windows offering; there may not.
Discussions around this didn't end up with a clear answer (though both
sides were clear in their minds!)
Network Needs + Storage
=======================
Infiniband vs Ethernet for network - networking won't run IB for SRCF
What are we doing for storage? Still up in the air. Internal high-speed
network is the "normal" solution; external connections are more "normal".
Must develop in anticipation that the system will last >5 years (greater
than the lifespan of any particular technology).
Where is the data coming from? It's hard to get it from the large
network. And where is it going? This is the Big Challenge.
Have to plan ahead to buy things In Bulk to help in this area.
"Infiniband is dead" - Ray from Networking. Err...really? Interesting
discussions.
No Independent Clusters -> Share Instead
=======================
barley cluster exists now - jumping-in point to a shared cluster
Doesn't address Windows community right now - needs to come up later
Software To Run
===============
Some codes are not converting quickly (Stata) - fair warning
Admin and Control
=================
Some users want to be able to kill/renice jobs.
In a shared environment, you assume that users run separately and the job
schedulers take care of allocation. Users can kill their own jobs;
sysadmins can kill runaway jobs; operators/managers can manage specific
groups (based on toolset)
"You get what you ask for" - if you ask for 7 days, and your job runs for
8, your job dies. Discussion around whether this is a good idea
Checkpointing of code is vital, but hard.
Schedulers set a *policy*. Maybe we want to work around it, but it seems
dangerous...
SU-HPC Group
============
su-hpc-group@lists.stanford.edu