plague: Job waited too long for repo to unlock. Killing it...

List overview All Threads
Download

newer

older

Please add to the wishlist

Revisor built in EL-5

Michael Schwendt

30 Dec 2007 30 Dec '07

10:54 a.m.

If in a failed job.log you see the message

Job waited too long for repo to unlock. Killing it...

please notify me.

It's a problem in the plague server code that results in a denial of service for subsequent build jobs. I have a traceback from Dec 28th, but in the context of the source code it doesn't make sense yet (because a few lines earlier the code ensures that the files to be copied exist and are readable). Buildsys runs a slightly modified version that adds a bit more debug output in this area.

[...]

37683 (grads/ppc): Build result files - [ 'state.log', 'grads-1.9b4-21.el5.src.rpm', 'grads-debuginfo-1.9b4-21.el5.ppc.rpm', 'grads-1.9b4-21.el5.ppc.rpm', 'root.log', 'build.log', 'job.log' ] 37683 (grads/x86_64): Build result files - [ 'root.log', 'build.log', 'grads-debuginfo-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.src.rpm', 'state.log', 'job.log' ] 37683 (grads/i386): Build result files - [ 'build.log', 'state.log', 'grads-1.9b4-21.el5.i386.rpm', 'job.log', 'root.log', 'grads-debuginfo-1.9b4-21.el5.i386.rpm', 'grads-1.9b4-21.el5.src.rpm' ] Repo 'fedora-5-epel': updating repository metadata... Exception in thread Repo: fedora-5-epel: Traceback (most recent call last): File "/usr/lib64/python2.3/threading.py", line 436, in __bootstrap self.run() File "/usr/share/plague/server/Repo.py", line 233, in run self._update_repo() File "/usr/share/plague/server/Repo.py", line 137, in _update_repo shutil.copy(src, file_in_dst) File "/usr/lib64/python2.3/shutil.py", line 72, in copy copymode(src, dst) File "/usr/lib64/python2.3/shutil.py", line 49, in copymode st = os.stat(src) OSError: [Errno 2] No such file or directory: '/srv/rpmbuild/server_work/fedora-5-epel/37683-grads-1.9b4-21.el5/x86_64/grads-1.9b4-21.el5.x86_64.rpm' 37683 (grads): Job finished.

Show replies by date

Patrice Dumas

30 Dec 30 Dec

12:47 p.m.

On Sun, Dec 30, 2007 at 05:54:51PM +0100, Michael Schwendt wrote:

...

37683 (grads/ppc): Build result files - [ 'state.log', 'grads-1.9b4-21.el5.src.rpm', 'grads-debuginfo-1.9b4-21.el5.ppc.rpm', 'grads-1.9b4-21.el5.ppc.rpm', 'root.log', 'build.log', 'job.log' ] 37683 (grads/x86_64): Build result files - [ 'root.log', 'build.log', 'grads-debuginfo-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.src.rpm', 'state.log', 'job.log' ] 37683 (grads/i386): Build result files - [ 'build.log', 'state.log', 'grads-1.9b4-21.el5.i386.rpm', 'job.log', 'root.log', 'grads-debuginfo-1.9b4-21.el5.i386.rpm', 'grads-1.9b4-21.el5.src.rpm' ]

I can say that it is one of my packages, but I don't know what could be wrong, the build logs don't look bad...

-- Pat

Michael Schwendt

4:29 p.m.

On Sun, 30 Dec 2007 19:47:49 +0100, Patrice Dumas wrote:

...

On Sun, Dec 30, 2007 at 05:54:51PM +0100, Michael Schwendt wrote:

...
37683 (grads/ppc): Build result files - [ 'state.log', 'grads-1.9b4-21.el5.src.rpm', 'grads-debuginfo-1.9b4-21.el5.ppc.rpm', 'grads-1.9b4-21.el5.ppc.rpm', 'root.log', 'build.log', 'job.log' ] 37683 (grads/x86_64): Build result files - [ 'root.log', 'build.log', 'grads-debuginfo-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.src.rpm', 'state.log', 'job.log' ] 37683 (grads/i386): Build result files - [ 'build.log', 'state.log', 'grads-1.9b4-21.el5.i386.rpm', 'job.log', 'root.log', 'grads-debuginfo-1.9b4-21.el5.i386.rpm', 'grads-1.9b4-21.el5.src.rpm' ]

I can say that it is one of my packages, but I don't know what could be wrong, the build logs don't look bad...

The build is successful. The build master copies the downloaded rpms into the needsign repo. The traceback is in Python's shutil.copy() when copying the mode of a successfully copied rpm:

[...] copyfile(src, dst) # succeeds copymode(src, dst) # src not found! huh? [...]

That's weird, since the rpms [to be copied, the "src" arg] are not deleted until all have been copied. The traceback in turn terminates the thread which controls the needsign repo, and the repo stays locked. With that, all subsequent build jobs cannot enter even the prep stage.

Dan Williams

31 Dec 31 Dec

10 a.m.

On Sun, 2007-12-30 at 17:54 +0100, Michael Schwendt wrote:

...

If in a failed job.log you see the message
Job waited too long for repo to unlock. Killing it...
please notify me.

It's a problem in the plague server code that results in a denial of service for subsequent build jobs. I have a traceback from Dec 28th, but in the context of the source code it doesn't make sense yet (because a few lines earlier the code ensures that the files to be copied exist and are readable). Buildsys runs a slightly modified version that adds a bit more debug output in this area.

Maybe just trap the exception, print it out, and continue? That way at least the server doesn't fall over, it just fails to copy one item. It might also help debugging to see if only specific files can't be copied...

Dan

...

[...]

37683 (grads/ppc): Build result files - [ 'state.log', 'grads-1.9b4-21.el5.src.rpm', 'grads-debuginfo-1.9b4-21.el5.ppc.rpm', 'grads-1.9b4-21.el5.ppc.rpm', 'root.log', 'build.log', 'job.log' ] 37683 (grads/x86_64): Build result files - [ 'root.log', 'build.log', 'grads-debuginfo-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.x86_64.rpm', 'grads-1.9b4-21.el5.src.rpm', 'state.log', 'job.log' ] 37683 (grads/i386): Build result files - [ 'build.log', 'state.log', 'grads-1.9b4-21.el5.i386.rpm', 'job.log', 'root.log', 'grads-debuginfo-1.9b4-21.el5.i386.rpm', 'grads-1.9b4-21.el5.src.rpm' ] Repo 'fedora-5-epel': updating repository metadata... Exception in thread Repo: fedora-5-epel: Traceback (most recent call last): File "/usr/lib64/python2.3/threading.py", line 436, in __bootstrap self.run() File "/usr/share/plague/server/Repo.py", line 233, in run self._update_repo() File "/usr/share/plague/server/Repo.py", line 137, in _update_repo shutil.copy(src, file_in_dst) File "/usr/lib64/python2.3/shutil.py", line 72, in copy copymode(src, dst) File "/usr/lib64/python2.3/shutil.py", line 49, in copymode st = os.stat(src) OSError: [Errno 2] No such file or directory: '/srv/rpmbuild/server_work/fedora-5-epel/37683-grads-1.9b4-21.el5/x86_64/grads-1.9b4-21.el5.x86_64.rpm' 37683 (grads): Job finished.

-- Fedora-buildsys-list mailing list Fedora-buildsys-list@redhat.com https://www.redhat.com/mailman/listinfo/fedora-buildsys-list

Michael Schwendt

11:04 a.m.

On Mon, 31 Dec 2007 11:00:12 -0500, Dan Williams wrote:

...

On Sun, 2007-12-30 at 17:54 +0100, Michael Schwendt wrote:

...
If in a failed job.log you see the message
Job waited too long for repo to unlock. Killing it...
please notify me.

It's a problem in the plague server code that results in a denial of service for subsequent build jobs. I have a traceback from Dec 28th, but in the context of the source code it doesn't make sense yet (because a few lines earlier the code ensures that the files to be copied exist and are readable). Buildsys runs a slightly modified version that adds a bit more debug output in this area.
Maybe just trap the exception, print it out, and continue? That way at least the server doesn't fall over, it just fails to copy one item.

The buildsys runs such a patched Repo.py already. It catches OSError, IOError, unlocks the locks and prints/logs the results of the file access check prior to when files are copied.

I also added a debug line in the package job code to see when it starts deleting the copied files. Normally it waits until a callback tells it that all files are copied.

...

It might also help debugging to see if only specific files can't be copied...

The offending file was copied, but shutil.copy() failed in its second part when trying to copy the file mode. It didn't find the source file it had just copied. :-}

Michael Schwendt

3 Jan 3 Jan

3:56 p.m.

On Sun, 30 Dec 2007 17:54:51 +0100, Michael Schwendt wrote:

...

If in a failed job.log you see the message
Job waited too long for repo to unlock. Killing it...
please notify me.

It's a problem in the plague server code that results in a denial of service for subsequent build jobs. I have a traceback from Dec 28th, but in the context of the source code it doesn't make sense yet (because a few lines earlier the code ensures that the files to be copied exist and are readable). Buildsys runs a slightly modified version that adds a bit more debug output in this area.

Certainly related, job #37767 just failed

Failed to copy /srv/rpmbuild/server_work/fedora-5-epel/37767-php-pecl-memcache-2.2.1-1.el5/ppc/php-pecl-memcache-2.2.1-1.el5.ppc.rpm to the repository directory.

and that should not happen either. I assume this is not the first time this has happened. The additional debug details are interesting.

In the server log, the individual arch jobs print the list of files downloaded from the builders. At that time they should be available already:

37767 (php-pecl-memcache/ppc): Build result files - [ 'php-pecl-memcache-2.2.1-1.el5.ppc.rpm', 'php-pecl-memcache-2.2.1-1.el5.src.rpm', 'root.log', 'php-pecl-memcache-debuginfo-2.2.1-1.el5.ppc.rpm', 'state.log', 'job.log', 'build.log' ]

37767 (php-pecl-memcache/x86_64): Build result files - [ 'root.log', 'build.log', 'state.log', 'php-pecl-memcache-2.2.1-1.el5.src.rpm', 'php-pecl-memcache-debuginfo-2.2.1-1.el5.x86_64.rpm', 'php-pecl-memcache-2.2.1-1.el5.x86_64.rpm', 'job.log' ]

37767 (php-pecl-memcache/i386): Build result files - [ 'root.log', 'php-pecl-memcache-debuginfo-2.2.1-1.el5.i386.rpm', 'state.log', 'php-pecl-memcache-2.2.1-1.el5.i386.rpm', 'build.log', 'php-pecl-memcache-2.2.1-1.el5.src.rpm', 'job.log' ]

Further down in the log, the fedora-5-epel repo controller is asked to install the files into the needsign repo:

Repo 'fedora-5-epel': updating repository metadata...

At that time, the last of the ppc files is not yet accessible (i.e. it either doesn't exist or cannot be read):

[...] Repo: /srv/rpmbuild/server_work/fedora-5-epel/37767-php-pecl-memcache-2.2.1-1.el5/x86_64/php-pecl-memcache-2.2.1-1.el5.x86_64.rpm is accessible. Repo: /srv/rpmbuild/server_work/fedora-5-epel/37767-php-pecl-memcache-2.2.1-1.el5/ppc/php-pecl-memcache-2.2.1-1.el5.ppc.rpm is inaccessible.

Further proof-reading of source code necessary... it should not populate needsign before all files are downloaded.

Michael Schwendt

4 Jan 4 Jan

8:55 a.m.

...

Certainly related, job #37767 just failed

Failed to copy /srv/rpmbuild/server_work/fedora-5-epel/37767-php-pecl-memcache-2.2.1-1.el5/ppc/php-pecl-memcache-2.2.1-1.el5.ppc.rpm to the repository directory.

Status update:

A work-around and debug-helper has been applied. I've moved the "does the downloaded file exist and is readable?" check _in front_ of the downloader callback. It now checks the downloaded file as early as possible, and if the check fails, the download is marked as failed, so plague retries downloading it at most 10 times. If that fails, there is a serious problem of course (and that's not what we've seen). And if the good download turns inaccessible later although it had passed the check, that would be *very* interesting and will appear in the logs. Perhaps this mysterious download/add_to_repo problem is only a rare bug in threaded Python and working around it might be the only thing we can do.

5963

Age (days ago)

5968

Last active (days ago)

epel-devel@lists.fedoraproject.org

6 comments

3 participants

tags (0)

participants (3)

Dan Williams
Michael Schwendt
Patrice Dumas