[HCoop-Discuss] Reorganizing, people-wise and tech-wise

Thu Jun 25 16:34:08 EDT 2009

On Thu, Jun 25, 2009 at 02:49:53PM -0400, Adam Chlipala wrote:
> The trouble that I just announced on hcoop-announce is only the latest 
> in a series of events showing what bad shape we are in.  Thankfully, 
> it's not financial bad shape; all of our accounting matches up, and we 
> have almost $7000 of member balances available in our bank account to 
> finance some hardware purchases to be covered by dues over time.  What 
> follows is the vision that came to me over the course of dealing all 
> this week with thousands of processes stuck forever in the 'D' state 
> while accessing AFS files.  Please share your responses and 
> counterproposals, so that we can figure out what the heck we should be 
> doing.
> 
> Here is what I think we should do, based on my latest whims.
> 
> First, I think the costs of AFS and distributed systems magic in general 
> are outweighing the benefits.  We've had these problems with AFS and 
> Kerberos:

I didn't have any experience with AFS before, but my membership at HCoop
strengthened my opinion that AFS bloated and doesn't fit well into
Unix-like operating systems. So if nobody is able to maintain this
(megacz seemed to be only person in the last months who would be able to
this), I definitively discourage HCoop from using it.

> 1. I think it's very likely that those stuck processes wouldn't be stuck 
> with standard local filesystems.  Extra complexity from distribution 
> seems to enhance the terribleness of various situations, including 
> machine overload.
> 2. We have a hell of a time getting volunteer admins who understand this 
> stuff well enough to set it up from scratch.  We had cclausen, who 
> stormed off in a huff; and we have megacz, who (understandably) isn't 
> willing to make hard time commitments.  Everything else that we use 
> besides Domtool is commodity stuff that half our members have 
> significant experience administering, and a good portion of those 
> members are willing to do it for free for a few hours a week.  (A 
> handful volunteered just today in our IRC channel.)
> 3. Doing all kinds of standard hosting things with Kerberos and AFS is a 
> pain!  We've lost a non-trivial number of members who cited this as 
> their main reason for leaving.  I think the ideal situation for almost 
> all of our members (including me) is to have access to "a normal Linux 
> machine" with some extra precautions against data loss and some 
> provisions for reasonably high availability.

I was about to leave HCoop today, but I saw that people are thinking
about the problems and are willing to fix it, that's what counts for me.

Instead of giving up the idea of a distributed filesystems, we should
reevaluate other possible solutions. I have experience with NFS and 9P.
NFS kind of worked seems be wide-spread. 9P is primarily present on Plan
9.

9P has the advantage of being simple, but has not as good Linux support
as NFS (there's a kernel client and a user space server). I could
imagine having a Plan 9 fileserver and diskless Linux clients.

However, 9P is more a long-term project, nothing we would have setup by
the end of the next month. So I suggest we look at NFS. Moreover, there
are some cluster filesystems out there. Maybe someone knows more about
them and has made some experience.

> Thus, what I'm suggesting now is that each member be assigned to a 
> particular machine, and every daemon that has anything to do with him 
> should live on that machine.  We would add new machines for new members 
> each time we run out of resources on the old machines.  I think it still 
> makes sense to use LDAP for global accounts and passwords, but with 
> local home directories.  We should have a separate main server that does 
> DNS and forwards mail for @hcoop.net addresses to appropriate machines, 
> based on usernames.  All of our meta-services (like our various co-op 
> web sites) should live there, too.  Finally, we ideally want at least 
> one spare machine sitting around, booted up and tested regularly for 
> hardware failures.  (At $50/U/mo. for colocation, I think the extra 
> expense for this is worth it.)

I vote against separate machines, my experience at TIP9UG showed that
it's quite valuable to have a shared filesystem.

> Here are the steps I can think of for getting from here to there.
> 
> 1. Form a committee of at least 3 members who are responsible for all 
> hardware purchases (and ideally most of whom aren't volunteering for 
> anything else).  They should determine what we should buy and who we 
> should buy it from.
> 2. This committee should find either 2 or 3 fairly beefy 1U servers for 
> us to buy and colocate at a new provider with rates more in line with 
> the average.

Peer1 wants 75 USD/1U, is that correct?

> 3. We figure out which provider this should be and get those servers to 
> them.  We shouldn't end up needing to pay more than $200/mo. for 
> colocation, which comes out to less than $1 per member.
> 4. Come up with a set of at least 4 (and probably not more, either) 
> volunteer admins, with clear requirements on how much time they devote 
> to HCoop and when.  A few people on IRC miraculously offered to be "on 
> call," without any prompting from me.  At a minimum, we should have 
> scheduled "check-in points" several times a day, when someone with root 
> access is always scheduled to make sure everything is working and take 
> action to fix things that turn out to be broken.  We should use one of 
> the standard monitoring tools to make this easier.

Maybe we should choose the location for the servers near someone
reliable (who is HCoop member for a long time or so), so that if some
hardware stuff has to be fixed, he can go to the hoster.

We should also choose a hosting company that offers IPv6.

> 5. These admins divide up the work to set up the servers as outlined 
> above, documenting everything in our wiki.  We have the main server, the 
> member server, and (optionally) the spare server.  I expect that we can 
> buy a beefy enough member server that we can handle the current load 
> just fine (there are used machines available for under $1000 that have 
> more capacity than all 5 of our currently-on machines put together), 
> though we would want to start planning immediately for adding new member 
> servers when needed.

But we will keep the current server, am I right? Or are we going to sell
at least some of them?

> 6. We have another "migration period" like when we moved from 
> InterServer to Peer 1.  Members have some amount of time to get all 
> their stuff set up on the new systems.  This should be a pain, but still 
> a lot easier than last time, because _this_ time we'd be moving from a 
> complicated, nonstandard architecture to the normal architecture that 
> most of us have set up on our PCs.
> 7. We run regular load monitoring on all of our servers, watching for 
> when it's appropriate for the hardware committee to find us a new 
> machine to buy and install as an additional member server.  (Repeat this 
> step indefinitely.)
> 
> Things to consider for the future, once the basics are working: Opt-in 
> AFS access.  A dedicated database server with hot-spare mirroring of all 
> updates to a standby server.
> 
> So, what does everyone think?

Regards,
Matthias-Christian