>>> bind/workers:582 Date: 09 Jun 1994 16:26:21 PDT From: brisco@hercules.rutgers.edu (Tp Brisco) Subject: Re: Using DNS to Load Balance > > The upshot of the changes is that BIND can run a specific > > program to do the zone transfer. The program should - of course - > > make various appropriate computations, and reorder the RRs as > > you see fit, then return an appropriate exit code to indicate > > the relative success of the zone transfer --- viola! You can > > have the RRs reordered with as much frequency as you dare turn > > down the TTLs (I've gone as low as 5 minutes with no apparent > > ill effects). > > > > Maybe I'm missing something here but why not use the ROUND_ROBIN feature > to give one of a list of addresses. It works well for load balancing > here and doesn't require zone transfers. e.g > appl IN CNAME sys1 > IN CNAME sys2 > sys1 IN A 127.0.0.1 > sys2 IN A 127.0.0.2 > IN A 127.0.0.3 > results in sys1 getting half the load and each address on sys2 getting > a quarter. > > I don't know if ROUND_ROBIN is available in all implementations. I found it > in name 4.9.2. > > Simon Hamilton The problem that _we_ ran into was that ROUND_ROBIN works on everything with multiple similar record types -- which isn't necessarily what we wanted. Particularly; I've seen other people "gripe" about the NS records getting shuffled. Additionally, at Rutgers, we depend (*cringe* - don't flame) on the ordering -- in particular; we've got some "cluster" machines that have "private" networks for intra-cluster communications. We'd prefer those A RR's to not be mucked with. (BTW: we do advertise the less preferential addresses, since we'd prefer that those be used - but only if the more preferential addresses are dead for some reason). Also; the ROUND_ROBIN approach assumes that the ratios of distinction of "load" don't change - i.e. when that big SAS user logs onto "sys2" and throws your statistical model out of whack. The ``SETTRANSFER'' (my compilation conditional) can _react_ to actual load changes - if you so wish. Don't get me wrong - if ROUND_ROBIN works for you - use it in good health. [ Anyone know what happended to the "SHUFFLE_A" code? Was is superceded by the ROUND_ROBIN? ] ROUND_ROBIN does a fine statistical randomization (in fact, one of the early proposals I put forward was a _weighted_ statistical randomization technique). ROUND_ROBIN did, however, have a couple of unpleasant surprises for us. One of the drawbacks of *ALL* RR _ordering_ mechanisms is that the *&*^%^ "sortlist" qualifier (I think it's in the resolver - but may be in both) can really undo all the hard work we've all put into this .... I _should_ point out that the SETTRANSFER doesn't *have* to return all RR's - though it is recommended (in general) - so if you want to *force* a particular address (in spite of the bloody sortlists out there), simply return only a single record - but that could have some nasty surprises also (e.g. failed connections). Lastly, with the low-TTL "SETTRANSFER", there's nothing to stop you from changing the *content* of the records either (just imagine the fun you can have with TXT records now!). So far, I've not found something I _cannot_ do with the SETTRANSFER code and a shell script -- just for kicks I had BIND paging our sysadmin for a while .... (though he didn't seem to see the humor in it). Yes - there is more overhead associated with the zone transfers - but that the frequency of zone transfers and the computation incurred at each transfer is 100% under your control. *You* choose how often and how intensely. Quite frankly, from the observations I've made, the extra CPU-expense incurred by the zone transfers still pales in comparison with the overhead of BIND in general (that's not a complaint - just an observation). My next idea was to put PostScript (tm) into TXT RRs and have a little PostScript interpreter built into BIND. SRA, however, recommended Lisp .... Maybe FORTH would be a better idea ... Hmmm... Perl anyone? ;-> Tp. ...!rutgers!brisco (UUCP) brisco@pilot.njin.net (Internet) brisco@ZODIAC (BITNET) 908-445-2351 (VOICE) Just say "Moo" >>> bind/workers:583 Date: Fri, 08 Apr 1994 13:55:10 EDT To: Paul A Vixie From: Tp Brisco Subject: Re: Appropriate time to ... Hmm - I probably should've waited until I was in a better mood before I replied to you ... > i'd like you to consider my views on all this, even though you've > clearly got a lot invested in the way you're doing it now. i do > not think that doing this in the zone transfer mechanism is at all > the right way. reasons against it include: > > (1) transfer ordering is NOT deterministic unless all hosts do the > same (unspecified) thing with ordering and there are always the > same number of hosts in the path from primary-secondary-resolver > (consider older BIND versions that used LIFO ordering of cache > RR's). the best thing you can guarantee, without changing the > protocol so that the ordering is _specified_, will be the same > as round robin: stochastic randomness. Ah, under the ``TRANSFER'' scheme there are no primaries, just secondaries. Each secondary should be doing it's own computation - so that primary->secondary reordering isn't really an issue. There isn't even a way of defining "dynamic information" in a primary under my scheme. Resolver code - at least the code that I've looked at in detail - is generally incredibly stupid (not to mention non-compliant). Unfortunately, it appears that most people are using fairly ancient resolver code - and broken code at that. Emperically I've noticed that the resolver code typically uses just the first RR anyway - most fail - a few will actually walk down the RRs if multiple A's are provided. None (that I've noticed) make appropriate use of the "additional info" section of the responses. The secondary -> resolver relationship is simple enough that it shouldn't be a concern about reordering inside of there. Things like sortlists and such can be problematic, but you get what you asked for (whether thats what you intended or not). Even sortlists provide minimal problems - sortlists typically sort based upon network number - and most cluster elements exist on the same LAN for pragmatic reasons. In cases of sorting networks based on topology, then presumably the hop-count is more important than actual load-sharing anyway. Anyway, without my editorializing on resolvers, the key point is that primaries don't exist for "dynamic zones" - but rather a series of one or more secondaries exist, and each calculates the information independantly. > (2) when you begin to apply cluster-style balancing based on load > average or some other metric, you will quickly find that the > host metrics change much more often than you will be prepared > to do zone transfers. are you prepared for 15-second MIN TTLs? > 15-second refresh? one minute would almost work right now, but > as hosts and networks keep getting faster at 2X every 18 months, > with many "sessions" being like WWW (connect, grab, disconnect; > repeat at intervals), 15 seconds will still end up being too > short. I think we both need to admit that "balancing" based upon load averages is a fallacy - at least on systems that aren't prepared to dynamically move existing sessions between hosts without blowing the connection. Nothing that I know of is capable of this today - and if that becomes the case - we're going to have to re-think DNS from the ground up. The sad fact is that _any_ load balancing is going to be an approximation - simply because "load" isn't an easily measured fact (take a look at OPSTAT some time! Sheesh!). I think I'm prepared for 15 second TTLs - but is BIND? :-) If you look at the DIFFS (anonymous ftp to pilot.njin.net:pub/TRANSFER/*) and the README I indicate that I torqued down (apparently successfully) the hard-limit minTTL of 5 minutes down to 1 minute. For "most" load-balancing applications this should be sufficient. The FYI (it isn't an RFC for a reason ...) talks about this as being "reasonable" (no, it's not impressive, but it does easily solve the bulk of the problems out there - with no mods to the protocol). Undoubtedly the smallest safe granularity of measure CPU type loads is the 1 minute mark, anything less than that starts to be meaningless. Let's face it, if you take the sample rate down to about 1 * 10^9 you start finding the load either 100% or 0% (I've faced a lot of these same problems in measuring networks - OPSTAT has run across a lot of these problems also). I think I can safely claim that statistics just isn't prepared to deal with an overly small sample in order to extrapolate to such a large change. Again, this is engineered to solve 90% of the existing needs today. The other 10% are complex enough to warrant "special purpose nameservers" (I think the FYI states this blatantly). The truth of the matter is, is that if you _really_ want 100% accurate / 0TTL based information, you need a specialized nameserver for the time being (until we can figure out a better way of doing this altogether). My FYI has generated about four copies (from random people out on the net) who have written "specialized nameservers" to deal with such problems when they have them. The big problem is for the other 90% of the people who need a generalized mechanism for multitudes of problems where "reasonably accurate" is all they need. I believe that this solves those problems in a general way, with reasonable CPU impact. I don't claim that it is glossy, impressive, or otherwise technologically astounding - but pragmatic solutions rarely are. (Which is probably why Bill Gates is rich, and I'm not). > the "right" way to do this, as i've said all along (though my words on this > are probably buried so far back in the namedroppers archive that most people > have never seen them), is to add some kind of weighting, via either SNMP or > new RR types, that the resolvers can use to make ``directed'' ordering choice s > after they receive a list of addresses. probably using a new RR type is best > since this information will have to be cached lest we start melting wires. > i recognized this as a good aspect of your original CIP proposal, though i > wasn't completely please with the overall CIP approach since there was too > much that was new about it. > > you just cannot do clustering on top of host-address information without some > kind of meta-data directing the clients. the BIND resolver already does > address reordering based on network connectivity (preferring close addresses > to more distant ones) which would blow away any ordering you did manage to > achieve with the zone transfer ordering approach. I certainly agree with you here - but maybe for different reasons. As I see it, the clients will ultimately want to order information based upon their own criteria - so we should be (at least) shipping around vectors of information instead of points of information. Then allow the client side to do all the computation they wish in order to select the appropriate record. I believe my next stab is going to be to put a PostScript interpreter inside of DNS and have TXT records shipped around with little programs inside of them in order to ascertain the appropriate RR to return - this 100% dynamic client-derived information is going to be the only ultimate solution. This information could be cached, and re-executed as necessary to derive new information. As far as I can _really_ tell what was wrong with the CIP proposal is that (from the DNS WG / IETF standpoint) it specified "different" RR handling - and was therefore a change to the DNS specification, and was therefore akin to helping the USSR get back together. (Look, _I_ don't consider the DNS spec as something holy - the Internet/ARPAnet _used_ to be set up so that things evolved.) See my notes above re: client-side reordering. > i know you've spent a lot of time shepparding this through the IETF, and i'm > sure you're ready to shoot me for suggesting a return to first principles. > but > while you've been doing what you've been doing, i've been considering the > implications of dynamic address assignment (distributed database updates sent > from terminal servers back to name servers), mobile hosts (ip address changes > as you pass through cell boundaries; a new "LOC" (location) RR that's updated > based on GPS data in the client); multi-homed hosts (not well solved yet); > variable width subnetting on non-octet boundaries (in-addr.arpa isn't enough, > and we're going to have to be able to express subnet boundaries and maps in > the DNS itself if OSPF and CIDR are really going to save us from running out > of IP address space). > > in other words, i'm viewing the clustering issue as part of a much wider > problem set and i would like to find a common architectural principle that > will make all of these problems easier to solve. > > doing it via zone transfers is not the answer i was hoping for. > > sorry to be a pill about this. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<- Not a problem. Let's just make sure and leave most of the fighting here. I've seen your work for years, and appreciate 99.9% of it greatly. While I'm willing to argue vehemently over this - I can't say the same for most of the other things I'm involved in :-). Re; the shepparding through the IETF. I really can't get too hot about it - frankly much of what you are saying is what I was saying when I first walked in. To my own self, I've backed off on a number of design principals in order to get _some_ standard set. What I do like about this approach is the near-insane level of flexibility it provides, and frankly it _is_ a bit of a "clever hack" (though I cannot claim to be the sole source of the idea). I suspect that the kernel of the idea originated with Jon Postel, but so much "committee" was going on that it is difficult to trace back any single portion of it (which is why the FYI discusses so much of the history - both to give credit, and to avoid having someone else suffer through that much committee work again). I suppose the worrisome thing about the IETF is that after you listen to it enough, it starts to make sense. Look carefully at the problems you mention above: 1) load balancing 2) multi-homed hosts 3) dynamic addressing 4) mobile hosts 5) GPS "LOC" information 6) subnet masks 7) OSPF/CIDR 1&2 are essentially the same "class" of problem, as are 3&4. 5&6 are yet another. 7 is, of course, a problem unto itself :-). 1&2 are *descriptive* based problems - where we describe a given situation. 3&4 are *prescriptive* based problems - where only the hosts can really feed us the information. 5&6 (in a perfect world) could be fixed by relatively flexible TXT-like records. 5 could conceivably be considered to be of the same class as 3&4. We can't really solve 7 until we're sure that it is the correct problem. The FYI doesn't attempt to solve all of these. In fact, I would guess that you'd get a considerable amount of friction if you tried to solve (2) inside of the DNS server. In some sense, 3/4/5 hinge around the ability for hosts (or some trusted third party) to update the server in a believable way (I can just imagine the fun if we tried to put in GPS LOC info in the existing records - we could make people appear all over the world :-). 3/4/5 is undoubtedly best addressed by the "dynamic update" people (and 4 only if we agree to treat mobileHosts along the lines that cellular phones are handled). The upshot of what I learned is that (1) is not an atomic problem. The closer I looked, the more problems I found. Statistical randomization (a'la CIP, RoundRobin, SA) is fine - if what you are trying to model can be done statistically. Then I ran into one fine young gentleman who wanted A RRs ordered according to the RTT of packets - sounds insane doesn't it? Until I realized that I really wanted that also ... It would be nice if Rutgers University had *one* nameserver (instead of the 12 it has now) and the topologically closest address "happened" to be the one that was presented first. That way, I could tell all of my users the same thing: "Use this name" instead of the current "if you're over there, use that name. But if you go up to Newark, use this name instead". Ugh! Or "normal users" (of which we could have many a long drunken rip-roaring laughter over) really shouldn't be subject to topological optimization information - they're just not up to it (Look, I do have _some_ sympathy for them :-). And yes, I had the "MobileIP" folks jumping on me to provide them with a solution via this mechanism also, but I saw that it really wasn't appropriate (and yes, I was there when we turned down their new RR request). Yes I realize that this does not solve all of the problems - but it _does_ solve 1/7th of them, which is a helluva lot more than we've gotten out of the DNS WG in the last five years... And no, this isn't the ultimate solution, but I do believe it is as close as we're going to be able to get with the current DNS mechanisms in place. DNS simply just wasn't designed to handle extremely dynamic information. Grab a copy of the draft FYI and read over it. You may be pleasantly surprised by how humble it really is. Its not exactly rocket science, but ... Tp. ...!rutgers!brisco (UUCP) brisco@pilot.njin.net (Internet) brisco@ZODIAC (BITNET) 908-932-2351 (VOICE) 908-445-2351 as of 5/27/94 PM T.P. Brisco, Associate Director for Network Operations, Computing Services, Telecommunications Division Rutgers University, Piscataway, NJ 08855-0879 Just say "Moo"