[HCoop-Discuss] Reorganizing, people-wise and tech-wise
Adam Chlipala
adamc at hcoop.net
Thu Jun 25 14:49:53 EDT 2009
The trouble that I just announced on hcoop-announce is only the latest
in a series of events showing what bad shape we are in. Thankfully,
it's not financial bad shape; all of our accounting matches up, and we
have almost $7000 of member balances available in our bank account to
finance some hardware purchases to be covered by dues over time. What
follows is the vision that came to me over the course of dealing all
this week with thousands of processes stuck forever in the 'D' state
while accessing AFS files. Please share your responses and
counterproposals, so that we can figure out what the heck we should be
doing.
Here is what I think we should do, based on my latest whims.
First, I think the costs of AFS and distributed systems magic in general
are outweighing the benefits. We've had these problems with AFS and
Kerberos:
1. I think it's very likely that those stuck processes wouldn't be stuck
with standard local filesystems. Extra complexity from distribution
seems to enhance the terribleness of various situations, including
machine overload.
2. We have a hell of a time getting volunteer admins who understand this
stuff well enough to set it up from scratch. We had cclausen, who
stormed off in a huff; and we have megacz, who (understandably) isn't
willing to make hard time commitments. Everything else that we use
besides Domtool is commodity stuff that half our members have
significant experience administering, and a good portion of those
members are willing to do it for free for a few hours a week. (A
handful volunteered just today in our IRC channel.)
3. Doing all kinds of standard hosting things with Kerberos and AFS is a
pain! We've lost a non-trivial number of members who cited this as
their main reason for leaving. I think the ideal situation for almost
all of our members (including me) is to have access to "a normal Linux
machine" with some extra precautions against data loss and some
provisions for reasonably high availability.
Thus, what I'm suggesting now is that each member be assigned to a
particular machine, and every daemon that has anything to do with him
should live on that machine. We would add new machines for new members
each time we run out of resources on the old machines. I think it still
makes sense to use LDAP for global accounts and passwords, but with
local home directories. We should have a separate main server that does
DNS and forwards mail for @hcoop.net addresses to appropriate machines,
based on usernames. All of our meta-services (like our various co-op
web sites) should live there, too. Finally, we ideally want at least
one spare machine sitting around, booted up and tested regularly for
hardware failures. (At $50/U/mo. for colocation, I think the extra
expense for this is worth it.)
Here are the steps I can think of for getting from here to there.
1. Form a committee of at least 3 members who are responsible for all
hardware purchases (and ideally most of whom aren't volunteering for
anything else). They should determine what we should buy and who we
should buy it from.
2. This committee should find either 2 or 3 fairly beefy 1U servers for
us to buy and colocate at a new provider with rates more in line with
the average.
3. We figure out which provider this should be and get those servers to
them. We shouldn't end up needing to pay more than $200/mo. for
colocation, which comes out to less than $1 per member.
4. Come up with a set of at least 4 (and probably not more, either)
volunteer admins, with clear requirements on how much time they devote
to HCoop and when. A few people on IRC miraculously offered to be "on
call," without any prompting from me. At a minimum, we should have
scheduled "check-in points" several times a day, when someone with root
access is always scheduled to make sure everything is working and take
action to fix things that turn out to be broken. We should use one of
the standard monitoring tools to make this easier.
5. These admins divide up the work to set up the servers as outlined
above, documenting everything in our wiki. We have the main server, the
member server, and (optionally) the spare server. I expect that we can
buy a beefy enough member server that we can handle the current load
just fine (there are used machines available for under $1000 that have
more capacity than all 5 of our currently-on machines put together),
though we would want to start planning immediately for adding new member
servers when needed.
6. We have another "migration period" like when we moved from
InterServer to Peer 1. Members have some amount of time to get all
their stuff set up on the new systems. This should be a pain, but still
a lot easier than last time, because _this_ time we'd be moving from a
complicated, nonstandard architecture to the normal architecture that
most of us have set up on our PCs.
7. We run regular load monitoring on all of our servers, watching for
when it's appropriate for the hardware committee to find us a new
machine to buy and install as an additional member server. (Repeat this
step indefinitely.)
Things to consider for the future, once the basics are working: Opt-in
AFS access. A dedicated database server with hot-spare mirroring of all
updates to a standby server.
So, what does everyone think?
More information about the HCoop-Discuss
mailing list