>>> bind/workers:582

Date:    09 Jun 1994 16:26:21 PDT
From:    brisco@hercules.rutgers.edu (Tp Brisco)
Subject: Re: Using DNS to Load Balance

> >     The upshot of the changes is that BIND can run a specific
> > program to do the zone transfer.  The program should - of course -
> > make various appropriate computations, and reorder the RRs as
> > you see fit, then return an appropriate exit code to indicate
> > the relative success of the zone transfer --- viola!  You can
> > have the RRs reordered with as much frequency as you dare turn
> > down the TTLs (I've gone as low as 5 minutes with no apparent
> > ill effects).
> > 
> 
> Maybe I'm missing something here but why not use the ROUND_ROBIN feature
> to give one of a list of addresses. It works well for load balancing
> here and doesn't require zone transfers. e.g
> appl	IN	CNAME	sys1
> 	IN	CNAME	sys2
> sys1	IN	A	127.0.0.1
> sys2	IN	A	127.0.0.2
> 	IN	A	127.0.0.3
> results in sys1 getting half the load and each address on sys2 getting
> a quarter.
> 
> I don't know if ROUND_ROBIN is available in all implementations. I found it
> in name 4.9.2.
> 
> Simon Hamilton

    The problem that _we_ ran into was that ROUND_ROBIN works on
everything with multiple similar record types -- which isn't
necessarily what we wanted.

    Particularly; I've seen other people "gripe" about the NS
records getting shuffled.  Additionally, at Rutgers, we depend
(*cringe* - don't flame) on the ordering -- in particular; we've
got some "cluster" machines that have "private" networks for
intra-cluster communications.  We'd prefer those A RR's to not be
mucked with.  (BTW:  we do advertise the less preferential
addresses, since we'd prefer that those be used - but only if the
more preferential addresses are dead for some reason).

    Also; the ROUND_ROBIN approach assumes that the ratios
of distinction of "load" don't change - i.e. when that 
big SAS user logs onto "sys2" and throws your statistical
model out of whack.  The ``SETTRANSFER'' (my compilation
conditional) can _react_ to actual load changes - if you
so wish.

    Don't get me wrong - if ROUND_ROBIN works for you - use it in
good health.  [ Anyone know what happended to the "SHUFFLE_A"
code?  Was is superceded by the ROUND_ROBIN?  ] ROUND_ROBIN does
a fine statistical randomization (in fact, one of the early
proposals I put forward was a _weighted_ statistical
randomization technique).  ROUND_ROBIN did, however, have a
couple of unpleasant surprises for us.

    One of the drawbacks of *ALL* RR _ordering_ mechanisms is
that the *&*^%^ "sortlist" qualifier (I think it's in the
resolver - but may be in both) can really undo all the hard work
we've all put into this ....  I _should_ point out that the
SETTRANSFER doesn't *have* to return all RR's - though it is
recommended (in general) - so if you want to *force* a particular
address (in spite of the bloody sortlists out there), simply
return only a single record - but that could have some nasty
surprises also (e.g.  failed connections).

    Lastly, with the low-TTL "SETTRANSFER", there's nothing to
stop you from changing the *content* of the records either (just
imagine the fun you can have with TXT records now!).

    So far, I've not found something I _cannot_ do with the
SETTRANSFER code and a shell script -- just for kicks I had BIND
paging our sysadmin for a while ....  (though he didn't seem to
see the humor in it).

    Yes - there is more overhead associated with the zone
transfers - but that the frequency of zone transfers and the
computation incurred at each transfer is 100% under your control.
*You* choose how often and how intensely.  Quite frankly, from
the observations I've made, the extra CPU-expense incurred by the
zone transfers still pales in comparison with the overhead of
BIND in general (that's not a complaint - just an observation).

    My next idea was to put PostScript (tm) into TXT RRs and
have a little PostScript interpreter built into BIND.  SRA,
however, recommended Lisp ....   Maybe FORTH would be a better
idea ...  Hmmm... Perl anyone? ;->

							    Tp.

...!rutgers!brisco (UUCP)               brisco@pilot.njin.net (Internet)
    brisco@ZODIAC (BITNET)              908-445-2351          (VOICE)

Just say "Moo"


>>> bind/workers:583

Date:    Fri, 08 Apr 1994 13:55:10 EDT
To:      Paul A Vixie <paul@vix.com>
From:    Tp Brisco <brisco@hercules.rutgers.edu>
Subject: Re: Appropriate time to ...


    Hmm - I probably should've waited until I was in a better
mood before I replied to you ...

> i'd like you to consider my views on all this, even though you've
> clearly got a lot invested in the way you're doing it now.  i do
> not think that doing this in the zone transfer mechanism is at all
> the right way.  reasons against it include:
> 
> (1)	transfer ordering is NOT deterministic unless all hosts do the
> 	same (unspecified) thing with ordering and there are always the
> 	same number of hosts in the path from primary-secondary-resolver
> 	(consider older BIND versions that used LIFO ordering of cache
> 	RR's).  the best thing you can guarantee, without changing the
> 	protocol so that the ordering is _specified_, will be the same
> 	as round robin: stochastic randomness.

    Ah, under the ``TRANSFER'' scheme there are no primaries,
just secondaries.  Each secondary should be doing it's own
computation - so that primary->secondary reordering isn't really
an issue.  There isn't even a way of defining "dynamic
information" in a primary under my scheme.  Resolver code - at
least the code that I've looked at in detail - is generally
incredibly stupid (not to mention non-compliant).  Unfortunately,
it appears that most people are using fairly ancient resolver
code - and broken code at that.  Emperically I've noticed that
the resolver code typically uses just the first RR anyway - most
fail - a few will actually walk down the RRs if multiple A's are
provided.  None (that I've noticed) make appropriate use of the
"additional info" section of the responses.  The secondary ->
resolver relationship is simple enough that it shouldn't be a
concern about reordering inside of there.  Things like sortlists
and such can be problematic, but you get what you asked for
(whether thats what you intended or not).  Even sortlists provide
minimal problems - sortlists typically sort based upon network
number - and most cluster elements exist on the same LAN for
pragmatic reasons.  In cases of sorting networks based on
topology, then presumably the hop-count is more important than
actual load-sharing anyway.

    Anyway, without my editorializing on resolvers, the key
point is that primaries don't exist for "dynamic zones" - but
rather a series of one or more secondaries exist, and each
calculates the information independantly.

> (2)	when you begin to apply cluster-style balancing based on load
> 	average or some other metric, you will quickly find that the
> 	host metrics change much more often than you will be prepared
> 	to do zone transfers.  are you prepared for 15-second MIN TTLs?
> 	15-second refresh?  one minute would almost work right now, but
> 	as hosts and networks keep getting faster at 2X every 18 months,
> 	with many "sessions" being like WWW (connect, grab, disconnect;
> 	repeat at intervals), 15 seconds will still end up being too
> 	short.

    I think we both need to admit that "balancing" based upon load
averages is a fallacy - at least on systems that aren't prepared
to dynamically move existing sessions between hosts without blowing
the connection.  Nothing that I know of is capable of this today -
and if that becomes the case - we're going to have to re-think DNS
from the ground up.  The sad fact is that _any_ load balancing is
going to be an approximation - simply because "load" isn't an
easily measured fact (take a look at OPSTAT some time! Sheesh!).

    I think I'm prepared for 15 second TTLs - but is BIND?  :-)
If you look at the DIFFS (anonymous ftp to
pilot.njin.net:pub/TRANSFER/*) and the README I indicate that I
torqued down (apparently successfully) the hard-limit minTTL of 5
minutes down to 1 minute.  For "most" load-balancing applications
this should be sufficient.  The FYI (it isn't an RFC for a reason
...)  talks about this as being "reasonable" (no, it's not
impressive, but it does easily solve the bulk of the problems out
there - with no mods to the protocol).  Undoubtedly the smallest
safe granularity of measure CPU type loads is the 1 minute mark,
anything less than that starts to be meaningless.  Let's face it,
if you take the sample rate down to about 1 * 10^9 you start
finding the load either 100% or 0% (I've faced a lot of these
same problems in measuring networks - OPSTAT has run across a lot
of these problems also).  I think I can safely claim that
statistics just isn't prepared to deal with an overly small
sample in order to extrapolate to such a large change.

    Again, this is engineered to solve 90% of the existing needs
today.  The other 10% are complex enough to warrant "special
purpose nameservers" (I think the FYI states this blatantly).
The truth of the matter is, is that if you _really_ want 100%
accurate / 0TTL based information, you need a specialized
nameserver for the time being (until we can figure out a better
way of doing this altogether).  My FYI has generated about four
copies (from random people out on the net) who have written
"specialized nameservers" to deal with such problems when they
have them.  The big problem is for the other 90% of the people
who need a generalized mechanism for multitudes of problems where
"reasonably accurate" is all they need.  I believe that this
solves those problems in a general way, with reasonable CPU
impact.  I don't claim that it is glossy, impressive, or
otherwise technologically astounding - but pragmatic solutions
rarely are.  (Which is probably why Bill Gates is rich, and I'm
not).

> the "right" way to do this, as i've said all along (though my words on this
> are probably buried so far back in the namedroppers archive that most people
> have never seen them), is to add some kind of weighting, via either SNMP or
> new RR types, that the resolvers can use to make ``directed'' ordering choice
s
> after they receive a list of addresses.  probably using a new RR type is best
> since this information will have to be cached lest we start melting wires.
> i recognized this as a good aspect of your original CIP proposal, though i
> wasn't completely please with the overall CIP approach since there was too
> much that was new about it.
> 
> you just cannot do clustering on top of host-address information without some
> kind of meta-data directing the clients.  the BIND resolver already does
> address reordering based on network connectivity (preferring close addresses
> to more distant ones) which would blow away any ordering you did manage to
> achieve with the zone transfer ordering approach.

    I certainly agree with you here - but maybe for different
reasons.  As I see it, the clients will ultimately want to order
information based upon their own criteria - so we should be (at
least) shipping around vectors of information instead of points
of information.  Then allow the client side to do all the
computation they wish in order to select the appropriate record.
I believe my next stab is going to be to put a PostScript
interpreter inside of DNS and have TXT records shipped around
with little programs inside of them in order to ascertain the
appropriate RR to return - this 100% dynamic client-derived
information is going to be the only ultimate solution.  This
information could be cached, and re-executed as necessary to
derive new information.

    As far as I can _really_ tell what was wrong with the CIP
proposal is that (from the DNS WG / IETF standpoint) it specified
"different" RR handling - and was therefore a change to the DNS
specification, and was therefore akin to helping the USSR get
back together.  (Look, _I_ don't consider the DNS spec as
something holy - the Internet/ARPAnet _used_ to be set up so that
things evolved.)

    See my notes above re: client-side reordering.

> i know you've spent a lot of time shepparding this through the IETF, and i'm
> sure you're ready to shoot me for suggesting a return to first principles.
> but
> while you've been doing what you've been doing, i've been considering the
> implications of dynamic address assignment (distributed database updates sent
> from terminal servers back to name servers), mobile hosts (ip address changes
> as you pass through cell boundaries; a new "LOC" (location) RR that's updated
> based on GPS data in the client); multi-homed hosts (not well solved yet);
> variable width subnetting on non-octet boundaries (in-addr.arpa isn't enough,
> and we're going to have to be able to express subnet boundaries and maps in
> the DNS itself if OSPF and CIDR are really going to save us from running out
> of IP address space).
> 
> in other words, i'm viewing the clustering issue as part of a much wider
> problem set and i would like to find a common architectural principle that
> will make all of these problems easier to solve.
> 
> doing it via zone transfers is not the answer i was hoping for.
> 
> sorry to be a pill about this.

  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<- Not a problem.  Let's just make
sure and leave most of the fighting here.  I've seen your work for
years, and appreciate 99.9% of it greatly.  While I'm willing to
argue vehemently over this - I can't say the same for most of
the other things I'm involved in :-).

    Re; the shepparding through the IETF.  I really can't get too
hot about it - frankly much of what you are saying is what I was
saying when I first walked in.  To my own self, I've backed off
on a number of design principals in order to get _some_ standard
set.  What I do like about this approach is the near-insane level
of flexibility it provides, and frankly it _is_ a bit of a
"clever hack" (though I cannot claim to be the sole source of the
idea).  I suspect that the kernel of the idea originated with Jon
Postel, but so much "committee" was going on that it is difficult
to trace back any single portion of it (which is why the FYI
discusses so much of the history - both to give credit, and to
avoid having someone else suffer through that much committee work
again).

    I suppose the worrisome thing about the IETF is that after
you listen to it enough, it starts to make sense.  Look carefully
at the problems you mention above:

        1) load balancing
        2) multi-homed hosts
        3) dynamic addressing
        4) mobile hosts
        5) GPS "LOC" information
        6) subnet masks
        7) OSPF/CIDR

    1&2 are essentially the same "class" of problem, as are 3&4.
5&6 are yet another.  7 is, of course, a problem unto itself :-).
1&2 are *descriptive* based problems - where we describe a given
situation.  3&4 are *prescriptive* based problems - where only
the hosts can really feed us the information.  5&6 (in a perfect
world) could be fixed by relatively flexible TXT-like records.  5
could conceivably be considered to be of the same class as 3&4.
We can't really solve 7 until we're sure that it is the correct
problem.

    The FYI doesn't attempt to solve all of these.  In fact, I
would guess that you'd get a considerable amount of friction if
you tried to solve (2) inside of the DNS server.  In some sense,
3/4/5 hinge around the ability for hosts (or some trusted third
party) to update the server in a believable way (I can just
imagine the fun if we tried to put in GPS LOC info in the
existing records - we could make people appear all over the world
:-).  3/4/5 is undoubtedly best addressed by the "dynamic update"
people (and 4 only if we agree to treat mobileHosts along the
lines that cellular phones are handled).

    The upshot of what I learned is that (1) is not an atomic
problem.  The closer I looked, the more problems I found.
Statistical randomization (a'la CIP, RoundRobin, SA) is fine - if
what you are trying to model can be done statistically.  Then I
ran into one fine young gentleman who wanted A RRs ordered
according to the RTT of packets - sounds insane doesn't it?
Until I realized that I really wanted that also ...  It would be
nice if Rutgers University had *one* nameserver (instead of the
12 it has now) and the topologically closest address "happened"
to be the one that was presented first.  That way, I could tell
all of my users the same thing:  "Use this name" instead of the
current "if you're over there, use that name.  But if you go up
to Newark, use this name instead".  Ugh!  Or "normal users" (of
which we could have many a long drunken rip-roaring laughter
over) really shouldn't be subject to topological optimization
information - they're just not up to it (Look, I do have _some_
sympathy for them :-).

    And yes, I had the "MobileIP" folks jumping on me to provide
them with a solution via this mechanism also, but I saw that it
really wasn't appropriate (and yes, I was there when we turned
down their new RR request).  Yes I realize that this does not
solve all of the problems - but it _does_ solve 1/7th of them,
which is a helluva lot more than we've gotten out of the DNS WG
in the last five years...  And no, this isn't the ultimate
solution, but I do believe it is as close as we're going to be
able to get with the current DNS mechanisms in place.  DNS simply
just wasn't designed to handle extremely dynamic information.

    Grab a copy of the draft FYI and read over it.  You may
be pleasantly surprised by how humble it really is.  Its not
exactly rocket science, but ...

                                                        Tp.


...!rutgers!brisco (UUCP)               brisco@pilot.njin.net (Internet)
    brisco@ZODIAC (BITNET)              908-932-2351          (VOICE)
                                        908-445-2351 as of 5/27/94 PM

    T.P. Brisco, Associate Director for Network Operations,
    Computing Services, Telecommunications Division
    Rutgers University, Piscataway, NJ 08855-0879

Just say "Moo"