The biggest MM2 problem which currently exists is that the crawler segfaults when running with more than 10 or 12 threads. The current configuration runs daily with 75 threads and crashes regularly:
[740512.481002] mm2_crawler[18149]: segfault at 30 ip 00007ffdd8201557 sp 00007ffd787d5250 error 4 in libcurl.so.4.3.0[7ffdd81d8000+63000] [783445.620762] mm2_crawler[20500]: segfault at 30 ip 00007f87477ff557 sp 00007f86e7fd4250 error 4 in libcurl.so.4.3.0[7f87477d6000+63000] [826619.130431] mm2_crawler[24376]: segfault at 30 ip 00007f7cee7ac557 sp 00007f7c8cfde250 error 4 in libcurl.so.4.3.0[7f7cee783000+63000] [869846.873962] mm2_crawler[27771]: segfault at 30 ip 00007ffd3bc07557 sp 00007ffd11ff8250 error 4 in libcurl.so.4.3.0[7ffd3bbde000+63000]
By preloading libcurl from F21 on the command-line seems to make the segfault go away. So somewhere between curl 7.29 (RHEL 7.1) and curl 7.37 (F21) something was fixed which would be needed on RHEL7.1 before switching to MM2.
Additionally the 4GB of RAM on mm-crawler01 are not enough to crawl all the mirrors in a reasonable time. Even if only started with 20 crawler threads instead of 75 the 4GB are not enough.
So a fix (or workaround) is needed for the libcurl problem and much more memory. Then the configuration needs to changed to make the crawler and umdl (and other cronjobs) not run as root. Most of the user changes should happen with a new release and new RPM of mirrormanger2.
Adrian
On Fri, Mar 20, 2015 at 04:38:24PM +0100, Adrian Reber wrote:
The biggest MM2 problem which currently exists is that the crawler segfaults when running with more than 10 or 12 threads. The current configuration runs daily with 75 threads and crashes regularly:
[740512.481002] mm2_crawler[18149]: segfault at 30 ip 00007ffdd8201557 sp 00007ffd787d5250 error 4 in libcurl.so.4.3.0[7ffdd81d8000+63000] [783445.620762] mm2_crawler[20500]: segfault at 30 ip 00007f87477ff557 sp 00007f86e7fd4250 error 4 in libcurl.so.4.3.0[7f87477d6000+63000] [826619.130431] mm2_crawler[24376]: segfault at 30 ip 00007f7cee7ac557 sp 00007f7c8cfde250 error 4 in libcurl.so.4.3.0[7f7cee783000+63000] [869846.873962] mm2_crawler[27771]: segfault at 30 ip 00007ffd3bc07557 sp 00007ffd11ff8250 error 4 in libcurl.so.4.3.0[7ffd3bbde000+63000]
By preloading libcurl from F21 on the command-line seems to make the segfault go away. So somewhere between curl 7.29 (RHEL 7.1) and curl 7.37 (F21) something was fixed which would be needed on RHEL7.1 before switching to MM2.
This is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1204825
Additionally the 4GB of RAM on mm-crawler01 are not enough to crawl all the mirrors in a reasonable time. Even if only started with 20 crawler threads instead of 75 the 4GB are not enough.
This has been increased to 32GB (thanks) and I had a few test runs of the crawler over the weekend with libcurl from F21:
All runs for 435 mirrors take at least 6 hours:
50 threads: http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resour...
50 threads with explicit garbage collection: http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resour...
75 threads: http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resour...
75 threads with explicitly setting variables to None at the end: http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resour...
Manually triggering the garbage collector makes almost no difference (if any at all). The crawler takes huge amount of memories and a really long time.
As much as I like the new threaded design I am not 100% convinced it the best solution when looking at the memory requirements. Somewhere memory must be leaking.
The next changes I will do is to sort the mirrors descending by the crawl duration to make sure the longest runnings crawls are started as early as possible (this was implemented in MM1). I will then try to start with 100 threads to see how long it takes and how much memory is required.
Adrian
On 23 March 2015 at 09:59, Adrian Reber adrian@lisas.de wrote:
On Fri, Mar 20, 2015 at 04:38:24PM +0100, Adrian Reber wrote:
The biggest MM2 problem which currently exists is that the crawler segfaults when running with more than 10 or 12 threads. The current configuration runs daily with 75 threads and crashes regularly:
[740512.481002] mm2_crawler[18149]: segfault at 30 ip 00007ffdd8201557
sp 00007ffd787d5250 error 4 in libcurl.so.4.3.0[7ffdd81d8000+63000]
[783445.620762] mm2_crawler[20500]: segfault at 30 ip 00007f87477ff557
sp 00007f86e7fd4250 error 4 in libcurl.so.4.3.0[7f87477d6000+63000]
[826619.130431] mm2_crawler[24376]: segfault at 30 ip 00007f7cee7ac557
sp 00007f7c8cfde250 error 4 in libcurl.so.4.3.0[7f7cee783000+63000]
[869846.873962] mm2_crawler[27771]: segfault at 30 ip 00007ffd3bc07557
sp 00007ffd11ff8250 error 4 in libcurl.so.4.3.0[7ffd3bbde000+63000]
By preloading libcurl from F21 on the command-line seems to make the segfault go away. So somewhere between curl 7.29 (RHEL 7.1) and curl 7.37 (F21) something was fixed which would be needed on RHEL7.1 before switching to MM2.
This is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1204825
Additionally the 4GB of RAM on mm-crawler01 are not enough to crawl all the mirrors in a reasonable time. Even if only started with 20 crawler threads instead of 75 the 4GB are not enough.
This has been increased to 32GB (thanks) and I had a few test runs of the crawler over the weekend with libcurl from F21:
All runs for 435 mirrors take at least 6 hours:
50 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resour...
50 threads with explicit garbage collection:
http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resour...
75 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resour...
75 threads with explicitly setting variables to None at the end:
http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resour...
Manually triggering the garbage collector makes almost no difference (if any at all). The crawler takes huge amount of memories and a really long time.
As much as I like the new threaded design I am not 100% convinced it the best solution when looking at the memory requirements. Somewhere memory must be leaking.
The next changes I will do is to sort the mirrors descending by the crawl duration to make sure the longest runnings crawls are started as early as possible (this was implemented in MM1). I will then try to start with 100 threads to see how long it takes and how much memory is required.
I would think that increasing threads would get bogged down by either network access or cpus. Since we aren't seeing more than 130% usage of CPU.. I am guessing it is bogging down to network access (eg it can only poll so many networks per second per interface and they can only return so quickly on that one interface). Do you think that having 2 or more crawler systems might do better?
On Mon, Mar 23, 2015 at 10:11:56AM -0600, Stephen John Smoogen wrote:
On 23 March 2015 at 09:59, Adrian Reber adrian@lisas.de wrote:
Additionally the 4GB of RAM on mm-crawler01 are not enough to crawl all the mirrors in a reasonable time. Even if only started with 20 crawler threads instead of 75 the 4GB are not enough.
This has been increased to 32GB (thanks) and I had a few test runs of the crawler over the weekend with libcurl from F21:
All runs for 435 mirrors take at least 6 hours:
50 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resour...
50 threads with explicit garbage collection:
http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resour...
75 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resour...
75 threads with explicitly setting variables to None at the end:
http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resour...
Manually triggering the garbage collector makes almost no difference (if any at all). The crawler takes huge amount of memories and a really long time.
As much as I like the new threaded design I am not 100% convinced it the best solution when looking at the memory requirements. Somewhere memory must be leaking.
The next changes I will do is to sort the mirrors descending by the crawl duration to make sure the longest runnings crawls are started as early as possible (this was implemented in MM1). I will then try to start with 100 threads to see how long it takes and how much memory is required.
100 threads is too much with 32GB. This OOM'd and was killed.
I would think that increasing threads would get bogged down by either network access or cpus. Since we aren't seeing more than 130% usage of CPU.. I am guessing it is bogging down to network access (eg it can only poll so many networks per second per interface and they can only return so quickly on that one interface). Do you think that having 2 or more crawler systems might do better?
I was hoping to implement 2 more crawlers in the end. With a simple setup it is possible to distribute the crawling to more machines. We know how many mirror hosts we have and the crawler can be given a host start and stop id. This distribution will not be perfect as this does not take into account that mirrors might be inactive/disabled/private but for a simple setup to distribute the load it should be good enough.
Adrian
infrastructure@lists.fedoraproject.org