A cached approach to mirroring
September 5th, 2005For some reason I dislike rsync. I could go into why, but that is not the point. I just started off that way because I wanted to lead into that I have been pondering alternative ways to mirror software. Initially a few of us tossed around the idea of using RSS feeds to mirror data. However, with the Mozilla mirror alone at 36,000+ files, it seemed a bit too ambitious. So here is the latest idea: treat each of the second tier mirrors as a simple cache to the master mirror.
A play-by-play: The user requests a file from mirror C, and it does not have the file, so before sending a “File not found” it checks the master staging mirror if it has it, and it does! Mirror C then happily grabs the bits, passing them on to the user as it gets them. The mirror keeps a copy of the data for itself so it does not have to bother the master the next time a client comes around. Mirror C is now a cache for the master, and follows all rules governing contemporary caching.
Wait, this just convoluted things. However, if you combine the model with a few other tools, you create a grand unified mirroring toolkit (gumt, not guft :). In particular, lets throw bouncer and sentry into the mix.
As you probably already know, bouncer is the tool that powers the “download now” link on www.mozilla.org. Bouncer is a database back ended request-to-random-mirror distributor, for joy. Sentry is a script that checks mirrors integrity and adds/removes it from bouncers database.
The new way (using Firefox 1.5 as an example):
- 0: the release lands on the staging server
- 0.02 seconds later: Sentry sends off requests provoking each of the mirrors to grab the freshest of data
- 7.23 or so minutes: Sentry finishes its checks and updates the bouncer database.
- 7.24 minutes: The favorite news site picks up the link, and all our happy users download the latest release at ultra snappy speeds.
Well, look what we have here! After roughly 10 minutes (in theory) we have gotten our bits out the mirrors, made sure the data is legit, AND have records for where each users got their release (for the most part).
Technically, this all is achievable. Bouncer and Sentry are in use right now, with 2.0 coming out soon. The only non-trivial part is turning mirrors into pseudo-caches. Probably could be done with a mod_cache/proxy combo… maybe another mod_foo is in order for apache.
I meant this to be an RFC. Thoughts? Is this loony? Or is it already being done somewhere?
Oh yeah, forget about true FTP users.







September 6th, 2005 at 12:36 am
FTP users and klients are important, so you have to deal with them. Why you shouldn’t treat also entries of directory structure as cachable, of course with shorten expiration times?
September 6th, 2005 at 3:31 am
Yes, it has a lot of sense.
But then, mirrors usually serve all or most of their files. IOW: They’ll have to download it from the main server anyway, so why not just download everything like now?
And remember that “mirror” also means that a mirror should be able to serve all files that the main sever has in case of failure of the main server
What we really need is a bittorrent-like protocol adapted to replace the FTP crap. The kernel.org people wanted something like that IIRC. Just put seeds in the main mirror, let everyone share their bandwith…scales much better than mirrors
September 6th, 2005 at 5:44 am
This sounds very similar to the Coral Cache (http://www.coralcdn.org/) except that you are proposing to proactively populate the mirrors instead of populating reactively.
- Chris
September 6th, 2005 at 9:36 am
Several years ago we had thought about doing this with the AFS file-system on several mirrors. The idea is that we’d have AFS running on several mirrors and leveraging its caching abilities would allow us to accomplish exactly what you were describing. The downer was that everybody would have to custom install AFS (not default by any means).
I think this is a great idea if you can leverage “commodity” software. Things like apache, PHP or squid to make this happen. Dunno.
September 7th, 2005 at 3:17 am
How would this handle removal of files? We do have a few files staged that actually shouldn’t be. (Yes, there’s a bug on it.)
How would it handle updating in the latest-foo dirs? It shouldn’t really check the main server for each one, right?
And I like ftp :-).
September 7th, 2005 at 4:00 pm
so before we had rss and bittorrent there was konspire a combination of both ideas. Rather than polling a rss for updates, when the master changes it tells everybody else that is subscribed (mirrors). Then what follows is p2p distribution of the updates throughout the rest of the mirrors. I wonder if using konspire to handle keeping mirrors up to date would actually work or be a good idea. konspire always seemed to be an awesome idea that never caught on.
September 19th, 2005 at 5:35 pm
Just a simplton network integrator/administrator.
Rsync - if you don’t like it, why not fix it ?
Seems to me that Rsync much more powerful than a simple cache.
I guess using RSS would be a fun hack, but still a hack, right?
Just trying to understand why you don’t want to fix one of the “software mirroring” projects already in the community and start a completely new one - or do you just want something proprietary to work on?
confused,
Brandon Fouts.
June 5th, 2006 at 9:49 am
have you just reinvented RoundRobin DNS?