[Hcoop-discuss] Planning for new hosting set-up update and request for comments

Sun Mar 26 16:36:01 EST 2006

Karl Chen wrote:
> The quote request letter looks fine to me.
>
> From the gripes page there's this bullet:
> * If one machine goes down, then our services will go down, too.
>
> Is there anything not prohibitively expensive we can actually do
> about this?  
>
> DNS can be redundant, mail can be queued, but anything that needs
> files is screwed if the file server goes down...
>   
Short of clustering, which is beyond our means at this point, I don't 
think there is anything we can do about this.  I think what we want is 
to prevent total hardware failure (with things such as redundant power 
supplies, UPS--very nice if this is provided by the facility--and RAID), 
and rapid on-site hardware support to replace failed hardware in the 
event of outages.  If we do this right, the mean time between failures 
may not be much worse than the lifetime of our hardware between upgrades 
anyway.  In any event the great majority of commercial web hosts out 
there function the same way--if a critical server kicks the bucket, the 
sites it hosts go down until someone fixes it or restores from backup to 
a replacement machine.  Truly clustered hosting is particularly 
difficult to do with the kind of dynamic stuff that many of our users 
run, and it is much more costly and complex to administer.   Once we get 
a whole bunch of servers, say at least half a rack, then it could make 
sense to keep a hot spare server idle on the rack, that could very 
rapidly replace any failing servers.  That could reduce downtime and 
also provide insurance since the probability of any one server failing 
gets much higher as the number of servers increases.

-ntk