[HCoop-Discuss] Reorganizing, people-wise and tech-wise
Matthias-Christian Ott
ott at hcoop.net
Thu Jun 25 16:34:08 EDT 2009
On Thu, Jun 25, 2009 at 02:49:53PM -0400, Adam Chlipala wrote:
> The trouble that I just announced on hcoop-announce is only the latest
> in a series of events showing what bad shape we are in. Thankfully,
> it's not financial bad shape; all of our accounting matches up, and we
> have almost $7000 of member balances available in our bank account to
> finance some hardware purchases to be covered by dues over time. What
> follows is the vision that came to me over the course of dealing all
> this week with thousands of processes stuck forever in the 'D' state
> while accessing AFS files. Please share your responses and
> counterproposals, so that we can figure out what the heck we should be
> doing.
>
> Here is what I think we should do, based on my latest whims.
>
> First, I think the costs of AFS and distributed systems magic in general
> are outweighing the benefits. We've had these problems with AFS and
> Kerberos:
I didn't have any experience with AFS before, but my membership at HCoop
strengthened my opinion that AFS bloated and doesn't fit well into
Unix-like operating systems. So if nobody is able to maintain this
(megacz seemed to be only person in the last months who would be able to
this), I definitively discourage HCoop from using it.
> 1. I think it's very likely that those stuck processes wouldn't be stuck
> with standard local filesystems. Extra complexity from distribution
> seems to enhance the terribleness of various situations, including
> machine overload.
> 2. We have a hell of a time getting volunteer admins who understand this
> stuff well enough to set it up from scratch. We had cclausen, who
> stormed off in a huff; and we have megacz, who (understandably) isn't
> willing to make hard time commitments. Everything else that we use
> besides Domtool is commodity stuff that half our members have
> significant experience administering, and a good portion of those
> members are willing to do it for free for a few hours a week. (A
> handful volunteered just today in our IRC channel.)
> 3. Doing all kinds of standard hosting things with Kerberos and AFS is a
> pain! We've lost a non-trivial number of members who cited this as
> their main reason for leaving. I think the ideal situation for almost
> all of our members (including me) is to have access to "a normal Linux
> machine" with some extra precautions against data loss and some
> provisions for reasonably high availability.
I was about to leave HCoop today, but I saw that people are thinking
about the problems and are willing to fix it, that's what counts for me.
Instead of giving up the idea of a distributed filesystems, we should
reevaluate other possible solutions. I have experience with NFS and 9P.
NFS kind of worked seems be wide-spread. 9P is primarily present on Plan
9.
9P has the advantage of being simple, but has not as good Linux support
as NFS (there's a kernel client and a user space server). I could
imagine having a Plan 9 fileserver and diskless Linux clients.
However, 9P is more a long-term project, nothing we would have setup by
the end of the next month. So I suggest we look at NFS. Moreover, there
are some cluster filesystems out there. Maybe someone knows more about
them and has made some experience.
> Thus, what I'm suggesting now is that each member be assigned to a
> particular machine, and every daemon that has anything to do with him
> should live on that machine. We would add new machines for new members
> each time we run out of resources on the old machines. I think it still
> makes sense to use LDAP for global accounts and passwords, but with
> local home directories. We should have a separate main server that does
> DNS and forwards mail for @hcoop.net addresses to appropriate machines,
> based on usernames. All of our meta-services (like our various co-op
> web sites) should live there, too. Finally, we ideally want at least
> one spare machine sitting around, booted up and tested regularly for
> hardware failures. (At $50/U/mo. for colocation, I think the extra
> expense for this is worth it.)
I vote against separate machines, my experience at TIP9UG showed that
it's quite valuable to have a shared filesystem.
> Here are the steps I can think of for getting from here to there.
>
> 1. Form a committee of at least 3 members who are responsible for all
> hardware purchases (and ideally most of whom aren't volunteering for
> anything else). They should determine what we should buy and who we
> should buy it from.
> 2. This committee should find either 2 or 3 fairly beefy 1U servers for
> us to buy and colocate at a new provider with rates more in line with
> the average.
Peer1 wants 75 USD/1U, is that correct?
> 3. We figure out which provider this should be and get those servers to
> them. We shouldn't end up needing to pay more than $200/mo. for
> colocation, which comes out to less than $1 per member.
> 4. Come up with a set of at least 4 (and probably not more, either)
> volunteer admins, with clear requirements on how much time they devote
> to HCoop and when. A few people on IRC miraculously offered to be "on
> call," without any prompting from me. At a minimum, we should have
> scheduled "check-in points" several times a day, when someone with root
> access is always scheduled to make sure everything is working and take
> action to fix things that turn out to be broken. We should use one of
> the standard monitoring tools to make this easier.
Maybe we should choose the location for the servers near someone
reliable (who is HCoop member for a long time or so), so that if some
hardware stuff has to be fixed, he can go to the hoster.
We should also choose a hosting company that offers IPv6.
> 5. These admins divide up the work to set up the servers as outlined
> above, documenting everything in our wiki. We have the main server, the
> member server, and (optionally) the spare server. I expect that we can
> buy a beefy enough member server that we can handle the current load
> just fine (there are used machines available for under $1000 that have
> more capacity than all 5 of our currently-on machines put together),
> though we would want to start planning immediately for adding new member
> servers when needed.
But we will keep the current server, am I right? Or are we going to sell
at least some of them?
> 6. We have another "migration period" like when we moved from
> InterServer to Peer 1. Members have some amount of time to get all
> their stuff set up on the new systems. This should be a pain, but still
> a lot easier than last time, because _this_ time we'd be moving from a
> complicated, nonstandard architecture to the normal architecture that
> most of us have set up on our PCs.
> 7. We run regular load monitoring on all of our servers, watching for
> when it's appropriate for the hardware committee to find us a new
> machine to buy and install as an additional member server. (Repeat this
> step indefinitely.)
>
> Things to consider for the future, once the basics are working: Opt-in
> AFS access. A dedicated database server with hot-spare mirroring of all
> updates to a standby server.
>
> So, what does everyone think?
Regards,
Matthias-Christian
More information about the HCoop-Discuss
mailing list