[HCoop-Discuss] Reorganizing, people-wise and tech-wise

Thu Jun 25 14:49:53 EDT 2009

The trouble that I just announced on hcoop-announce is only the latest 
in a series of events showing what bad shape we are in.  Thankfully, 
it's not financial bad shape; all of our accounting matches up, and we 
have almost $7000 of member balances available in our bank account to 
finance some hardware purchases to be covered by dues over time.  What 
follows is the vision that came to me over the course of dealing all 
this week with thousands of processes stuck forever in the 'D' state 
while accessing AFS files.  Please share your responses and 
counterproposals, so that we can figure out what the heck we should be 
doing.

Here is what I think we should do, based on my latest whims.

First, I think the costs of AFS and distributed systems magic in general 
are outweighing the benefits.  We've had these problems with AFS and 
Kerberos:
1. I think it's very likely that those stuck processes wouldn't be stuck 
with standard local filesystems.  Extra complexity from distribution 
seems to enhance the terribleness of various situations, including 
machine overload.
2. We have a hell of a time getting volunteer admins who understand this 
stuff well enough to set it up from scratch.  We had cclausen, who 
stormed off in a huff; and we have megacz, who (understandably) isn't 
willing to make hard time commitments.  Everything else that we use 
besides Domtool is commodity stuff that half our members have 
significant experience administering, and a good portion of those 
members are willing to do it for free for a few hours a week.  (A 
handful volunteered just today in our IRC channel.)
3. Doing all kinds of standard hosting things with Kerberos and AFS is a 
pain!  We've lost a non-trivial number of members who cited this as 
their main reason for leaving.  I think the ideal situation for almost 
all of our members (including me) is to have access to "a normal Linux 
machine" with some extra precautions against data loss and some 
provisions for reasonably high availability.

Thus, what I'm suggesting now is that each member be assigned to a 
particular machine, and every daemon that has anything to do with him 
should live on that machine.  We would add new machines for new members 
each time we run out of resources on the old machines.  I think it still 
makes sense to use LDAP for global accounts and passwords, but with 
local home directories.  We should have a separate main server that does 
DNS and forwards mail for @hcoop.net addresses to appropriate machines, 
based on usernames.  All of our meta-services (like our various co-op 
web sites) should live there, too.  Finally, we ideally want at least 
one spare machine sitting around, booted up and tested regularly for 
hardware failures.  (At $50/U/mo. for colocation, I think the extra 
expense for this is worth it.)

Here are the steps I can think of for getting from here to there.

1. Form a committee of at least 3 members who are responsible for all 
hardware purchases (and ideally most of whom aren't volunteering for 
anything else).  They should determine what we should buy and who we 
should buy it from.
2. This committee should find either 2 or 3 fairly beefy 1U servers for 
us to buy and colocate at a new provider with rates more in line with 
the average.
3. We figure out which provider this should be and get those servers to 
them.  We shouldn't end up needing to pay more than $200/mo. for 
colocation, which comes out to less than $1 per member.
4. Come up with a set of at least 4 (and probably not more, either) 
volunteer admins, with clear requirements on how much time they devote 
to HCoop and when.  A few people on IRC miraculously offered to be "on 
call," without any prompting from me.  At a minimum, we should have 
scheduled "check-in points" several times a day, when someone with root 
access is always scheduled to make sure everything is working and take 
action to fix things that turn out to be broken.  We should use one of 
the standard monitoring tools to make this easier.
5. These admins divide up the work to set up the servers as outlined 
above, documenting everything in our wiki.  We have the main server, the 
member server, and (optionally) the spare server.  I expect that we can 
buy a beefy enough member server that we can handle the current load 
just fine (there are used machines available for under $1000 that have 
more capacity than all 5 of our currently-on machines put together), 
though we would want to start planning immediately for adding new member 
servers when needed.
6. We have another "migration period" like when we moved from 
InterServer to Peer 1.  Members have some amount of time to get all 
their stuff set up on the new systems.  This should be a pain, but still 
a lot easier than last time, because _this_ time we'd be moving from a 
complicated, nonstandard architecture to the normal architecture that 
most of us have set up on our PCs.
7. We run regular load monitoring on all of our servers, watching for 
when it's appropriate for the hardware committee to find us a new 
machine to buy and install as an additional member server.  (Repeat this 
step indefinitely.)

Things to consider for the future, once the basics are working: Opt-in 
AFS access.  A dedicated database server with hot-spare mirroring of all 
updates to a standby server.

So, what does everyone think?