The OOM killer on bapp02 has terminated a few mirrormanager crawler processes. It seems it needs more memory or the number of parallel crawlers has to be further limited.
Another good idea would be to limit the duration of the rsync crawls in /etc/mirrormanager/prod.cfg to maybe one day (--timeout=86400) to avoid stale rsync processes.
Adrian
On Tue, 25 Nov 2014 10:27:19 +0100 Adrian Reber adrian@lisas.de wrote:
The OOM killer on bapp02 has terminated a few mirrormanager crawler processes. It seems it needs more memory or the number of parallel crawlers has to be further limited.
Well, it's got 16GB now... I can bump it to 24 without too much trouble. Will of course need a freeze break...
We are currently doing 60 threads. We could cut it down, but I guess I'd say lets try more memory first.
Another good idea would be to limit the duration of the rsync crawls in /etc/mirrormanager/prod.cfg to maybe one day (--timeout=86400) to avoid stale rsync processes
Good idea. Whats the config directive there? Seems we are not setting it at all currently.
kevin
On Tue, Nov 25, 2014 at 07:24:32AM -0700, Kevin Fenzi wrote:
The OOM killer on bapp02 has terminated a few mirrormanager crawler processes. It seems it needs more memory or the number of parallel crawlers has to be further limited.
Well, it's got 16GB now... I can bump it to 24 without too much trouble. Will of course need a freeze break...
We are currently doing 60 threads. We could cut it down, but I guess I'd say lets try more memory first.
Another good idea would be to limit the duration of the rsync crawls in /etc/mirrormanager/prod.cfg to maybe one day (--timeout=86400) to avoid stale rsync processes
Good idea. Whats the config directive there? Seems we are not setting it at all currently.
I just found out that the crawler is not configurable. I was thinking about update-master-directory-list.
To add a timeout to the rsync crawls it would be necessary to add it directly to /usr/share/mirrormanager/server/crawler_perhost:497
So maybe something for the mirrormanager rewrite.
Adrian
On Tue, Nov 25, 2014 at 07:24:32AM -0700, Kevin Fenzi wrote:
On Tue, 25 Nov 2014 10:27:19 +0100 Adrian Reber adrian@lisas.de wrote:
The OOM killer on bapp02 has terminated a few mirrormanager crawler processes. It seems it needs more memory or the number of parallel crawlers has to be further limited.
Well, it's got 16GB now... I can bump it to 24 without too much trouble. Will of course need a freeze break...
We are currently doing 60 threads. We could cut it down, but I guess I'd say lets try more memory first.
As dmesg on bapp02 has no timestamps it is hard to tell when the last crawler was terminated because of OOM. Looking at different log files it seems, however, that some crawler processes are terminated without finishing correctly. Especially mirrors which take a long time to crawl are not examined completely. So maybe it would be a good thing to decrease the number of parallel crawls to avoid OOM situations.
Adrian
On Tue, 2014-12-09 at 09:42 +0100, Adrian Reber wrote:
On Tue, Nov 25, 2014 at 07:24:32AM -0700, Kevin Fenzi wrote:
On Tue, 25 Nov 2014 10:27:19 +0100 Adrian Reber adrian@lisas.de wrote:
The OOM killer on bapp02 has terminated a few mirrormanager crawler processes. It seems it needs more memory or the number of parallel crawlers has to be further limited.
Well, it's got 16GB now... I can bump it to 24 without too much trouble. Will of course need a freeze break...
We are currently doing 60 threads. We could cut it down, but I guess I'd say lets try more memory first.
As dmesg on bapp02 has no timestamps it is hard to tell when the last crawler was terminated because of OOM. Looking at different log files it seems, however, that some crawler processes are terminated without finishing correctly. Especially mirrors which take a long time to crawl are not examined completely. So maybe it would be a good thing to decrease the number of parallel crawls to avoid OOM situations.
I have just managed to reproduce it right now while I was running the umdl script, maybe it didn't like having both running at the same time?
Pierre
infrastructure@lists.fedoraproject.org