Hi all!
InstantMirror currently works fine but involves a lot of configuration and that might actually stop someone from use it. I came across this project recently and found it really interesting. After reading the discussions on fedora-devel, warren's journal, InstantMirror wiki, I thought that I would come up with a redesign that would be more yum friendly and will involve minimal configuration for setup. Below is my point of view of new InstantMirror design.
Note: You can access the design proposal here (http://iyum.saini.co.in/index.php/InstantMirror) also.
Redesign Proposal ***************** Basically the most obvious use case of InstantMirror is to be used by Yum for updates. I propose to develop InstantMirror keeping in mind that it should integrate with Yum. In the new design I am using not multiple programs to get the job done as proposed by warren here (https://fedorahosted.org/InstantMirror/wiki/InstantMirrorDaemon). There will be only one daemon that will continuously listen to requests and if necessary it will fork itself. Below is the method of operation for the daemon.
Method of Operation ******************* 1. InstantMirror gets a client request for a URL. 2. Check: if URL is not in (RPM, metadata file) * Then its none of our business. * Let proxy handle it the normal way. * Done and exit. 3. Error Check: if remote host is not reachable * Check: if RPM/metadata is available in cache 1. Stream the RPM/metadata from cache. 2. Done and exit. * else 1. Throw a "No route to host" error. 2. Done and exit. 4. Check: if RPM/metadata is available in cache * Check: if RPM/metadata in cache is older than upstream 1. Delete RPM/metadata from cache. 2. Download and stream. 3. Done and exit. * Check: if RPM/metadata matches upstream or newer than upstream 1. Stream the RPM/metadata from cache. 2. Done and exit. * Check: if RPM/metadata does not exsit upstream 1. Delete RPM/metadata from cache. 2. Throw a "Not found" error. 3. Done and exit. 5. Check: if RPM/metadata is not available in cache * Download and stream. * Done and exit.
Download Process **************** In the above operation everything is clear except the download process. If a file is already being downloaded from upstream and another client request come in for the same file, then we have two options to continue downloading 1. Download via only first instance (master) and let the other instances (slaves) copy the partial content to the client. The disadvantage is that the slaves will be throttled by the master's download speed. 2. The other instance also starts downloading from upstream and append data to the local file. Stream the data to clients when download is finished. This is quite complicated and as of now I don't even know how to do it or even if its feasible to do it.
I am currently trying to get the hang of downloading process.
The above design is more or less same as the previous design with may be a bit of improvements. Now, we have decided to have two types of InstantMirror.
1. InstantMirror to be used by a small group of people. In this case we have to get rid of all the dependencies like squid, apache because for a small setup nobody is going to configure squid and apache. So, we use a proxy server implemented in python for this kind of setup and integrate InstantMirror with it in caching mode. So that it becomes easy to setup and don't require squid, apache or whatever else. 2. InstantMirror to be used by an organization. As almost all the organization ( i am focusing more on institutes/universities here) use a common proxy server to access the Internet, we will have the InstantMirror which can be integrated with squid. There will be no difficulty in setup as the people already use squid (assuming squid is widely used in Unix/Linux world) and know how to configure it. We can't use proxy server implemented in python here because no organization would ever agree to use a stripped down version of proxy server instead of squid.
If the above sounds interesting, then I also propose to build a Yum plugin (say Yum Client) which will interact with InstantMirror sitting near the proxy server. Yum Client will periodically update the InstantMirror with the information like which packages are very frequently updated by user (kernel, kde, yum), which packages are never updated or updated rarely (vim). Using this information InstantMirror will optimize the download queue and using some very simple techniques for queuing a download etc., we can optimize the bandwidth usage.
Imagine a university with thousands of Linux users and everyone is updating their system weekly. GBs of bandwidth is being wasted every week due to subsequent downloads of the same package.
If you have any suggestions for improvements, comments on the current design or you want to criticize the design, please reply back. They would really help me to improve.
InstantMirror - https://fedorahosted.org/InstantMirror/wiki InstantMirrorDaemon - https://fedorahosted.org/InstantMirror/wiki/InstantMirrorDaemon InstantMirror needs a rethink - http://www.redhat.com/archives/rhl-devel-list/2008-January/msg02341.html Warren Togami's Journal - http://wtogami.livejournal.com/20536.html
PS : This is my first RFC, if i wrote it badly please forgive me :)
------------------------------------------------------- Thank you, Kulbir Saini, Computer Science and Engineering, International Institute of Information Technology, Hyderbad, India - 500032.
My Home-Page: http://saini.co.in/ My Institute: http://www.iiit.ac.in/ My Linux-Blog: http://linux.saini.co.in/
IRC nick : generalBordeaux Channels : #fedora, #fedora-devel, #yum on freenode -------------------------------------------------------
On 10/03/2008, Kulbir Saini kulbirsaini@students.iiit.ac.in wrote:
1. InstantMirror to be used by a small group of people. In this case we
have to get rid of all the dependencies like squid, apache because for a small setup nobody is going to configure squid and apache. So, we use a proxy server implemented in python for this kind of setup and integrate InstantMirror with it in caching mode. So that it becomes easy to setup and don't require squid, apache or whatever else.
You've just described "squid in offline mode", more or less, above (offline being the "honour request from cache if we're unable to connect" part).
2. InstantMirror to be used by an organization. As almost all the
organization ( i am focusing more on institutes/universities here) use a common proxy server to access the Internet, we will have the InstantMirror which can be integrated with squid. There will be no difficulty in setup as the people already use squid (assuming squid is widely used in Unix/Linux world) and know how to configure it. We can't use proxy server implemented in python here because no organization would ever agree to use a stripped down version of proxy server instead of squid.
... and in this case, you want squid. Basically.
Imagine a university with thousands of Linux users and everyone is
updating their system weekly. GBs of bandwidth is being wasted every week due to subsequent downloads of the same package.
You just have to update the "maximum object size" for the squid cache to cover the largest package in the distro. It's more or less plug and play after that. You will need to make all your clients use the same mirrorlist for all your clients, of course :))
If you have any suggestions for improvements, comments on the current
design or you want to criticize the design, please reply back. They would really help me to improve.
Seriously, you will end up rewriting squid. Just use it :o) it's surprisingly light on memory if you configure it right. You wouldn't mind a single-purpose proxy running on your machine? You would have the same benefit from actually using squid, *plus* that can be used for other purposes (i.e. shared cache for browsers running on the machine, too). I've done this for many years now.
PS : This is my first RFC, if i wrote it badly please forgive me :)
It's not bad at all! You've covered all the bases I could see. You would still need a "mirror manager" process involved (to do any kind of prefetching). even with squid as a proxy; so please don't take the above as any kind of slap in the face. You've done a good job of the requirements doc :)
Hi!
On 10/03/2008, Kulbir Saini kulbirsaini@students.iiit.ac.in wrote:
1. InstantMirror to be used by a small group of people. In this case we
have to get rid of all the dependencies like squid, apache because for a small setup nobody is going to configure squid and apache. So, we use a proxy server implemented in python for this kind of setup and integrate InstantMirror with it in caching mode. So that it becomes easy to setup and don't require squid, apache or whatever else.
You've just described "squid in offline mode", more or less, above (offline being the "honour request from cache if we're unable to connect" part).
I agree with the point that while we can't connect to remote host we are not doing anything better than squid. Actually, I think that having something is better than nothing. If you can't fetch package from upstream, serve whatever you have.
2. InstantMirror to be used by an organization. As almost all the
organization ( i am focusing more on institutes/universities here) use a common proxy server to access the Internet, we will have the InstantMirror which can be integrated with squid. There will be no difficulty in setup as the people already use squid (assuming squid is widely used in Unix/Linux world) and know how to configure it. We can't use proxy server implemented in python here because no organization would ever agree to use a stripped down version of proxy server instead of squid.
... and in this case, you want squid. Basically.
I disagree with this point. Keeping in mind my knowledge of squid (actually squid behavior may be different than what I think or what I have understood), here we are doing more than what squid does.
1. If I get a request for xyz-0.1.2.rpm from a repo mirror M1 or repo R1. If we are using squid, then squid will fetch xyz-0.1.2.rpm from upstream and cache as well as serve the client. But if another request comes for xyz-0.1.2.rpm from a repo mirror M2 of repo R1 or a repo mirror M3 of repo R2, then squid will fetch xyz-0.1.2.rpm again, though the packages are same.
2. Squid stores the cached packages in a cryptic manner on the hard disk that can't be browsed. You can't really prioritize what you want to store as in case of Yum plugin below. We want to facilitate rpm search on the local mirror in the long run so we should know what is stored where. And rpm packages may need to be stored or transferred on a separate server which is not possible in squid.
Imagine a university with thousands of Linux users and everyone is
updating their system weekly. GBs of bandwidth is being wasted every week due to subsequent downloads of the same package.
You just have to update the "maximum object size" for the squid cache to cover the largest package in the distro. It's more or less plug and play after that. You will need to make all your clients use the same mirrorlist for all your clients, of course :))
We can't go to thousand of people and ask them to use the same mirrorlist. So, if we have InstantMirror everybody is free to use whatever repo or mirrorlist he/she wants to use and our system will cache relevant packages perfectly fine.
If you have any suggestions for improvements, comments on the current
design or you want to criticize the design, please reply back. They would really help me to improve.
Seriously, you will end up rewriting squid. Just use it :o) it's surprisingly light on memory if you configure it right. You wouldn't mind a single-purpose proxy running on your machine? You would have the same benefit from actually using squid, *plus* that can be used for other purposes (i.e. shared cache for browsers running on the machine, too). I've done this for many years now.
I don't think we are rewriting squid or so. It would be a very small python module which will do very limited by relevant things. Just intercepts connections for rpms and metadata and let squid handle everything else. We are not basically interrupting squid much. Actually squid can't meet all our requirements, and thats the reason we are forced to do something like this.
PS : This is my first RFC, if i wrote it badly please forgive me :)
It's not bad at all! You've covered all the bases I could see. You would still need a "mirror manager" process involved (to do any kind of prefetching). even with squid as a proxy; so please don't take the above as any kind of slap in the face. You've done a good job of the requirements doc :)
Thank a lot for noting down all those points. They helped me to have a more clear idea of InstantMirror. And thanks for the encouragement as well :)
------------------------------------------------------- Thank you, Kulbir Saini, Computer Science and Engineering, International Institute of Information Technology, Hyderbad, India - 500032.
My Home-Page: http://saini.co.in/ My Institute: http://www.iiit.ac.in/ My Linux-Blog: http://linux.saini.co.in/
IRC nick : generalBordeaux Channels : #fedora, #fedora-devel, #yum on freenode -------------------------------------------------------
A long, long time ago, Kulbir Saini wrote:
Thank a lot for noting down all those points. They helped me to have
a more clear idea of InstantMirror. And thanks for the encouragement as well :)
Hi, wondering what is considered _the_ current approach for making an internal Fedora proxy mirror ? Does the MirrorManager 0.4 code actually work, eg for 5 PC's ?
Did something else {other than full rsync mirroring} emerge to solve this type of problem ?
I was wondering if an automated way to get the clients to use the proxy / cache would by to implement dns entries for the real yum server names, that point to your internal ~mirror server; and hence bypass requiring individual machine proxy setup ?
Regards, DaveT.
On Tue, 2008-07-08 at 23:08 +1000, David Timms wrote:
A long, long time ago, Kulbir Saini wrote:
Thank a lot for noting down all those points. They helped me to have
a more clear idea of InstantMirror. And thanks for the encouragement as well :)
Hi, wondering what is considered _the_ current approach for making an internal Fedora proxy mirror ? Does the MirrorManager 0.4 code actually work, eg for 5 PC's ?
Did something else {other than full rsync mirroring} emerge to solve this type of problem ?
I was wondering if an automated way to get the clients to use the proxy / cache would by to implement dns entries for the real yum server names, that point to your internal ~mirror server; and hence bypass requiring individual machine proxy setup ?
Regards, DaveT.
Even with all it's weeknesses InstantMirror could be used with MirrorManager to do transparent caching/mirroring... Take a system with enough HDD space, install instantmirror, go to mirror manager admin page, create a new site, add your ips, and the mirror. The get the mirror reporting script and make it run with cron :) It isn't perfect but it's much berrer than nothing
Suren Karapetyan wrote:
On Tue, 2008-07-08 at 23:08 +1000, David Timms wrote:
A long, long time ago, Kulbir Saini wrote:
Thank a lot for noting down all those points. They helped me to have
a more clear idea of InstantMirror. And thanks for the encouragement as well :)
Hi, wondering what is considered _the_ current approach for making an internal Fedora proxy mirror ? Does the MirrorManager 0.4 code actually work, eg for 5 PC's ?
Did something else {other than full rsync mirroring} emerge to solve this type of problem ?
I was wondering if an automated way to get the clients to use the proxy / cache would by to implement dns entries for the real yum server names, that point to your internal ~mirror server; and hence bypass requiring individual machine proxy setup ?
Regards, DaveT.
Even with all it's weeknesses InstantMirror could be used with MirrorManager to do transparent caching/mirroring... Take a system with enough HDD space, install instantmirror, go to mirror manager admin page, create a new site, add your ips, and the mirror. The get the mirror reporting script and make it run with cron :) It isn't perfect but it's much berrer than nothing
I personally use squid in reverse proxy mode instead of InstantMirror. The main drawback of squid is the cache cannot be shared for other protocols (like rsync), but it is otherwise better because it handles cleanup and respects whatever maximum amount of storage you set. InstantMirror will keep growing and growing until it exhausts all space. InstantMirror also poorly handles concurrent clients.
Warren
On Tue, Jul 8, 2008 at 10:14 AM, Warren Togami wtogami@redhat.com wrote:
Suren Karapetyan wrote:
On Tue, 2008-07-08 at 23:08 +1000, David Timms wrote:
A long, long time ago, Kulbir Saini wrote:
Thank a lot for noting down all those points. They helped me to
have a more clear idea of InstantMirror. And thanks for the encouragement as well :)
Hi, wondering what is considered _the_ current approach for making an internal Fedora proxy mirror ? Does the MirrorManager 0.4 code actually work, eg for 5 PC's ?
Did something else {other than full rsync mirroring} emerge to solve this type of problem ?
I was wondering if an automated way to get the clients to use the proxy / cache would by to implement dns entries for the real yum server names, that point to your internal ~mirror server; and hence bypass requiring individual machine proxy setup ?
Regards, DaveT.
Even with all it's weeknesses InstantMirror could be used with MirrorManager to do transparent caching/mirroring... Take a system with enough HDD space, install instantmirror, go to mirror manager admin page, create a new site, add your ips, and the mirror. The get the mirror reporting script and make it run with cron :) It isn't perfect but it's much berrer than nothing
I personally use squid in reverse proxy mode instead of InstantMirror. The main drawback of squid is the cache cannot be shared for other protocols (like rsync), but it is otherwise better because it handles cleanup and respects whatever maximum amount of storage you set. InstantMirror will keep growing and growing until it exhausts all space. InstantMirror also poorly handles concurrent clients.
I wanted to try InstantMirror, but was in a rush. I just used a basic squid setup.
Are their any advantages of using it as a reverse mirror vs. regular squid usage?
Arthur Pemberton
Arthur Pemberton wrote:
I wanted to try InstantMirror, but was in a rush. I just used a basic squid setup.
Are their any advantages of using it as a reverse mirror vs. regular squid usage?
Regular squid wont be able to effectively cache if MirrorManager is telling you to use random mirrors. If you use MirrorManager you could receive the same local (reverse proxy) mirror as the first mirror every time if yum is used from your network block.
refresh_pattern repodata/.*$ 0 0% 0 refresh_pattern .*rpm$ 0 0% 0
Also with any squid.conf you will need these lines in order to guarantee that your repodata and RPMS stay consistent with your upstream source. This is because proxies do not handle data changing without changing the filename.
Warren Togami wtogami@redhat.com
On Tue, Jul 8, 2008 at 10:48 AM, Warren Togami wtogami@redhat.com wrote:
Arthur Pemberton wrote:
I wanted to try InstantMirror, but was in a rush. I just used a basic squid setup.
Are their any advantages of using it as a reverse mirror vs. regular squid usage?
Regular squid wont be able to effectively cache if MirrorManager is telling you to use random mirrors. If you use MirrorManager you could receive the same local (reverse proxy) mirror as the first mirror every time if yum is used from your network block.
refresh_pattern repodata/.*$ 0 0% 0 refresh_pattern .*rpm$ 0 0% 0
Also with any squid.conf you will need these lines in order to guarantee that your repodata and RPMS stay consistent with your upstream source. This is because proxies do not handle data changing without changing the filename.
Warren Togami wtogami@redhat.com
Ok thanks, what I did was comment out the mirror list url, and just use the base url. I'll add those refresh patterns, but doesn't the second one effectively turn of caching of *.rpm?
Arthur Pemberton wrote:
refresh_pattern repodata/.*$ 0 0% 0 refresh_pattern .*rpm$ 0 0% 0
Also with any squid.conf you will need these lines in order to guarantee that your repodata and RPMS stay consistent with your upstream source. This is because proxies do not handle data changing without changing the filename.
Warren Togami wtogami@redhat.com
Ok thanks, what I did was comment out the mirror list url, and just use the base url. I'll add those refresh patterns, but doesn't the second one effectively turn of caching of *.rpm?
Not exactly. It checks with the source server on every request if the data changed, but it doesn't re-download the entire thing. The only way we could do this without checking the upstream source is if all filenames on the mirrors changed every time their contents change. This is possible and we considered this for repodata, but decided against it because it would have broke earlier clients. This is also currently not possible with the RPMS themselves. Hence the need for refresh_pattern rules.
refresh_pattern images/.*$ 0 0% 0
I just realized that you probably want this additional rule to provide the same guarantees for stage2.img and other stuff in that directory.
Warren Togami wtogami@redhat.com
On Tue, Jul 8, 2008 at 11:12 AM, Warren Togami wtogami@redhat.com wrote:
Arthur Pemberton wrote:
refresh_pattern repodata/.*$ 0 0% 0 refresh_pattern .*rpm$ 0 0% 0
Also with any squid.conf you will need these lines in order to guarantee that your repodata and RPMS stay consistent with your upstream source. This is because proxies do not handle data changing without changing the filename.
Warren Togami wtogami@redhat.com
Ok thanks, what I did was comment out the mirror list url, and just use the base url. I'll add those refresh patterns, but doesn't the second one effectively turn of caching of *.rpm?
Not exactly. It checks with the source server on every request if the data changed, but it doesn't re-download the entire thing. The only way we could do this without checking the upstream source is if all filenames on the mirrors changed every time their contents change. This is possible and we considered this for repodata, but decided against it because it would have broke earlier clients. This is also currently not possible with the RPMS themselves. Hence the need for refresh_pattern rules.
refresh_pattern images/.*$ 0 0% 0
I just realized that you probably want this additional rule to provide the same guarantees for stage2.img and other stuff in that directory.
Warren Togami wtogami@redhat.com
Okay thanks, I had apparently not completely understood how refresh patterns were used, I am clearer now.
Le mardi 08 juillet 2008 à 12:12 -0400, Warren Togami a écrit :
The only way we could do this without checking the upstream source is if all filenames on the mirrors changed every time their contents change. This is possible and we considered this for repodata, but decided against it because it would have broke earlier clients.
So repodata is condemned to be broken with proxies just because it was designed broken ? Please reconsider. There are many places in the world where you can only access trhough proxies (for good reasons) and yum can not really be used right now.
Just autogenerate to sets of metadata and deprecate the proxy-unfriendly version after a few years.
On Tue, 2008-07-08 at 19:36 +0200, Nicolas Mailhot wrote:
Le mardi 08 juillet 2008 à 12:12 -0400, Warren Togami a écrit :
The only way we could do this without checking the upstream source is if all filenames on the mirrors changed every time their contents change. This is possible and we considered this for repodata, but decided against it because it would have broke earlier clients.
So repodata is condemned to be broken with proxies just because it was designed broken ? Please reconsider. There are many places in the world where you can only access trhough proxies (for good reasons) and yum can not really be used right now.
Just autogenerate to sets of metadata and deprecate the proxy-unfriendly version after a few years.
Repodata will very shortly have a unique name to it, rather than a static name. repomd.xml will remain unchanged though.