[HCoop-Discuss] Planning

Davor Ocelic docelic at hcoop.net
Thu Jul 16 16:56:38 EDT 2009


On Thu, 16 Jul 2009 15:37:47 -0400
Adam Chlipala <adamc at hcoop.net> wrote:

> 
> I don't think that argument is particularly valid in this case.  My 
> point is that there is a standard set of daemons that folks run on
> local Linux machines.  Many of our members already have gone through
> the process of learning to use them effectively.  Learning to use
> something new is an inherent cost, and we should avoid it unless it's
> justified by the projected pay-off.  In this case, I believe that
> barely any of our members really want a distributed filesystem.

That's true; distributed FS was not a primary concern of our users.

AFS was the choice of the HCoop sysadmin team, and my hope has been
that AFS would be easy enough to use that users would not care what
was running in the background.

Quick comment on the poll thing:

First, like said above, I think users
should not be directly involved in the decision about the filesystem we
use. Or at least, it's a detour from the idea we've been having by now
that it doesn't matter until we take care of the administration.

And another thing, from the "distributed" perspective, I think that
besides AFS, there's also the issue of Kerberos. The real, undeniable
issue is the inability to use SSH keys for passwordless login with
Kerberos auth. So question is not just about distributed FS, but
about distributed infrastructure altogether.

I suggest we move this discussion to hcoop-sysadmin for those who want
to focus on technical issues (I added the appropriate Reply-To: field
for that).

Others who want to comment on general planning, please continue this
here on the original list, hcoop-discuss at lists.hcoop.net.


Ok, now back to AFS and onto technical issues.

Overall, I think we succeeded in making AFS a non-issue for newcomers,
even though there were a few occurrences of specific users complaining
repeatedly, which I mostly categorized as ignorance on their part.

I find AFS enormously better than the usual Unix/Linux filesystem on
home partitions, and have trouble understanding how someone, after
considering everything, can be of opinion that benefits do not outweigh
eventual shortcomings.

That said, though, we did have the following suboptimal experiences. I
list them here with all relevant details I can think of. Comments
and pointers on any of those are appreciated.


1) We had an unusual amount of trouble setting custom daemons running
with renewing tokens, and even now when there are no more complaints,
I suspect that many users use cron to restart their services every X
hours to solve the problem.

Specifically, the approach that did and did not work is the
"pagsh" thing explained here:

  http://wiki.hcoop.net/MemberManual/RunningUnattendedCommands

Adam Megacz kindly proposed another mechanism which he has
been advertising as the only correct way to go, and so far I haven't
heard of any report claiming that it did not work:

  http://wiki.hcoop.net/RunningUnattendedCommandsWithoutRunInPagsh


2) Apache has been known to "hang" from time to time, not working
flawlessly for long periods of time.

Things become serious when in a period of 5 days, it hanged 3 times.

To solve the problem, Adam Chlipala made a cron job that visits
his website (one of about 400 served from Mire.hcoop.net) every
minute, and if there's no server reply within 10 seconds, it
restarts Apache.

I believed that Apache would stop serving only when it was
overloaded, and that a restart couldn't complete within 1 minute
till the next check, leading us to an infinite restart cycle.

However, it turns out that the restart happens pretty quickly and
only rarely do I see a restart more than once in a row, and as it
happens that a restart always solves the problem, the cron job has
been a great success.

This approach has a benefit of restarting apache no matter the
cause of the problem, but it does make us restart & forget every
time instead of looking into the problem.

Current situation is this: I usually receive 1 notice of Apache
restart every day, at about the same time (roughly 4:00 AM US
Eastern).

It's left to be investigated what happens at that particular time
each day (more or less each day, I can save the messages that come
in so we can check for sure).


3) Our Exim4 mail setup is configured to save mail to users'
~/Maildir, which is a separate volume.

(Each user has the usual "USER" token, and "USER.daemon" token.
Users authenticate with Kerberos password only, while their
.daemon tokens are intended for "machine use" and all have a
keytab in Deleuze's (mail server's) /etc/keytabs/user.daemon/USER,
readable by root and corresponding user).

So like all other services that save files to users' directories,
Exim4 authenticates via keytab (by running our "get-token" script),
then saves mail to user's Maildir.

I have copied get-token script to mire.hcoop.net:/tmp/get-token
for anyone who wants to inspect it.

Every now and then it happens that after running get-token, writing
to the Maildir fails anyway.

I don't have details on what exactly happens and what the exact
error message is, but in that case Exim saves mail to some off-AFS
spool location, and we have a cron script that every 15 minutes or
so sifts through the spool and tries to re-deliver the mail, when
most of the time this re-delivery succeeds.

I am sure this has been happening with older AFS versions (before
upgrading to 1.4.7 and later). I don't know if it is still happening.

Mike Olson could eventually provide more information on what his
findings about the whole thing and the error message were.


4) Even though Mire and Deleuze are on a local LAN, it has been
happening that Mire loses connection, showing "?" in ls output
and giving "Operation timed out" on fs operations.

At a certain point in time, with older AFS versions, this was
happening pretty often for users (myself included, while in a
shell session).

The thing would correct itself with no intervention after a period
of anywhere from 5 to 35 minutes.

I always thought that the issue was per-volume, not per-host, but
Adam Megacz had some information saying that it was per-host.

The -aborttreshold 0 option mentioned on the OpenAFS lists and
an upgrade to 1.4.10 was made in an effort to solve that.

There have been no reports of this happening again for quite 
some time, but IIRC Megacz specifically asked if anyone noticed
this issue, and someone replied that they did, once or twice.

Megacz, additional info on this point would be useful here if you
have any.


5) After upgrading to 1.4.10 and rebooting, our AFS performance
deteriorated in about all possible ways, the primary being Courier
processes accumulating in D state, waiting on AFS. This was making
mail unreadable.

Adam Chlipala "solved" the situation by a cron job restarting
Courier every minute, and we were running that way for about 10 days.

Another reboot of the AFS & mail server (Deleuze) with the same kernel
and AFS version (1.4.10) seems to have solved the problem, and I think
we're now back to the previous AFS behavior, that is -- working well.


5) General throughput of the AFS filesystem seems to be much lower than
you'd get from a direct Unix partition.

I did not measure myself, and please correct me if I'm wrong, but from
reports on the OpenAFS list, I get an impression that on a usual
setup with disks that can do 100 MB/s directly, it is considered good
if you get 20 - 40 MB/s out of AFS.

And frankly, our system really looks as if its throughput is in 
that range.


6) Above performance problem is increased when we're running
our daily backup run to rsync.net.

Though I think that the backup we're making is incremental and
encrypted, so there may be a lot of processing power getting spent
there, unrelated to AFS.


Thanks,
-doc



More information about the HCoop-Discuss mailing list