Yesterday my server had a power failure and the ldap database got corrupted. The weird thing is that the corruption caused the ldap init script to hang and it stopped the server boot. Today I corrupted the database on purpose to reproduce the problem but it's not happening anymore. Beside all the precautions (ups, raid, journaled filesystems, etc...) and beside the fact it probably was a really particoular database corruption, shouldn't the init scripts have a timeout? If any of the scripts hangs the whole boot process stops. Isn't this a really dangerous behaviour?
Hello, Federico napsal(a):
Beside all the precautions (ups, raid, journaled filesystems, etc...) and beside the fact it probably was a really particoular database corruption, shouldn't the init scripts have a timeout? If any of the scripts hangs the whole boot process stops. Isn't this a really dangerous behaviour?
Ignoring initscript timeouts could be even more dangerous, e.g. when starting iptables or auditd. Mirek
On 4/15/06, Miloslav Trmac mitr@volny.cz wrote:
Hello, Federico napsal(a):
Beside all the precautions (ups, raid, journaled filesystems, etc...) and beside the fact it probably was a really particoular database corruption, shouldn't the init scripts have a timeout? If any of the scripts hangs the whole boot process stops. Isn't this a really dangerous behaviour?
Ignoring initscript timeouts could be even more dangerous, e.g. when starting iptables or auditd.
Ideally, services depending on network, even things like web servers, should not fail when there is no connection and instead wait for one. With laptops, having to wait while it figures out you're not connected to any networks is very time-consuming and frustrating. Network startup shouldn't delay the boot process, and services depending on it should wait until it's started (though possibly connect to any available networks like loopback in the meantime). Of course, some of this is not possible with the current init system, so a new system that allows for this kind of thing would be a good thing. If a system is dependent on a network to even be usable (eg when an important partition is network mounted), it should wait, but for most people waiting for networks to connect is not needed.
n0dalus.
On Sat, 2006-04-15 at 22:01 +0930, n0dalus wrote:
Ideally, services depending on network, even things like web servers, should not fail when there is no connection and instead wait for one.
Indeed. Now we have this static order for services initialization which sucks. The new schemes will (hopefully) introduce dynamic dependencies. But I feel that this is still too inflexible -- we need a well defined API to be able to programmatically wait for a service to start/stop/etc
For example, say we have service A (networking), and service B (httpd) that depends on service A. Lets also assume that B needs A only towards the end of its startup. With a 'services' API, the startup sequence for B could be written as: //do time consuming part that does not depend on A service_wait(A) //do part that depends on A
Then we can start A & B in parallel and things will (A) sort themselves automagically, and (B) achieve the fastest possible startup time, especially on the new multicore boxes coming our way.
Why should httpd depend on networking? what is the network comes up later... do you not want httpd to already be running?
I'd vote for a trap for init scripts, where (like the BSDs) you could ctrl-c out of a script if it's taking too long
Dimi Paun wrote:
On Sat, 2006-04-15 at 22:01 +0930, n0dalus wrote:
Ideally, services depending on network, even things like web servers, should not fail when there is no connection and instead wait for one.
Indeed. Now we have this static order for services initialization which sucks. The new schemes will (hopefully) introduce dynamic dependencies. But I feel that this is still too inflexible -- we need a well defined API to be able to programmatically wait for a service to start/stop/etc
For example, say we have service A (networking), and service B (httpd) that depends on service A. Lets also assume that B needs A only towards the end of its startup. With a 'services' API, the startup sequence for B could be written as: //do time consuming part that does not depend on A service_wait(A) //do part that depends on A
Then we can start A & B in parallel and things will (A) sort themselves automagically, and (B) achieve the fastest possible startup time, especially on the new multicore boxes coming our way.
On Sun, 2006-04-16 at 23:22 -0400, Harry Hoffman wrote:
Why should httpd depend on networking? what is the network comes up later... do you not want httpd to already be running?
You're reading to much in the exact example there :) Just use A and B.
But I admit, I was a bit off-topic regarding the timeout problem. I was commenting on the need to have greater control of these inter service dependencies.
As for the timeout, that would have to be handled by the services manager that starts the services up. Ctrl-C is fine if you're in front of the console, but the reboot may be unattended, we need to have a system that is able to handle things by itself as well.
--- Dimi Paun dimi@lattica.com wrote:
On Sun, 2006-04-16 at 23:22 -0400, Harry Hoffman wrote:
Why should httpd depend on networking? what is the network comes up later... do you not want httpd to already be running?
You're reading to much in the exact example there :) Just use A and B.
But I admit, I was a bit off-topic regarding the timeout problem. I was commenting on the need to have greater control of these inter service dependencies.
As for the timeout, that would have to be handled by the services manager that starts the services up. Ctrl-C is fine if you're in front of the console, but the reboot may be unattended, we need to have a system that is able to handle things by itself as well.
This is exactly what i was meaning. If an init script is taking too long (10 minutes?) it must be killed. Think if you have a remote server and the boot is stopped cos the ldap init script is blocked. You would need to get on the site to fix it or find someone expert enough to do it for you.
2006-04-18 Federico simon3z@yahoo.com wrote
--- Dimi Paun dimi@lattica.com wrote:
On Sun, 2006-04-16 at 23:22 -0400, Harry Hoffman wrote:
Why should httpd depend on networking? what is the network comes up later... do you not want httpd to already be running?
You're reading to much in the exact example there :) Just use A and B.
But I admit, I was a bit off-topic regarding the timeout problem. I was commenting on the need to have greater control of these inter service dependencies.
As for the timeout, that would have to be handled by the services manager that starts the services up. Ctrl-C is fine if you're in front of the console, but the reboot may be unattended, we need to have a system that is able to handle things by itself as well.
This is exactly what i was meaning. If an init script is taking too long (10 minutes?) it must be killed. Think if you have a remote server and the boot is stopped cos the ldap init script is blocked. You would need to get on the site to fix it or find someone expert enough to do it for you.
This has bit me _many_ times (but it has been the shutdown process, not the boot process that hangs). Normally from a dead NFS-server, causing some other process to wait for some I/O, which again blocks the shutdown. It is still part of "init", but just thought I'd mention it so the same timeouts are considered for shutdowns as well.
Rgds.
Ola Thoresen
On Sat, Apr 15, 2006 at 02:14:36AM -0700, Federico wrote:
Yesterday my server had a power failure and the ldap database got corrupted. The weird thing is that the corruption caused the ldap init script to hang and it stopped the server boot. Today I corrupted the database on purpose to reproduce the problem but it's not happening anymore. Beside all the precautions (ups, raid, journaled filesystems, etc...) and beside the fact it probably was a really particoular database corruption, shouldn't the init scripts have a timeout? If any of the scripts hangs the whole boot process stops. Isn't this a really dangerous behaviour?
I personally think using Berkeley DB is dangerous behavior, but what can you do?
Fortunately db_recover seems to be taking care of these issues pretty reliably. I wonder if a quick call to it at the beginning of the ldap init script might be a good idea...
Steve
Once upon a time, Steven Pritchard steve@silug.org said:
I personally think using Berkeley DB is dangerous behavior, but what can you do?
Yeah, that's why I'm planning to skip /var/db/*.db, put everything in /etc/{passwd,shadow}, and let nscd cache it all for me.
Federico <simon3z <at> yahoo.com> writes:
Yesterday my server had a power failure and the ldap database got corrupted. The weird thing is that the corruption caused the ldap init script to hang and it stopped the server boot. Today I corrupted the database on purpose to reproduce the problem but it's not happening anymore. Beside all the precautions (ups, raid, journaled filesystems, etc...) and beside the fact it probably was a really particoular database corruption, shouldn't the init scripts have a timeout? If any of the scripts hangs the whole boot process stops. Isn't this a really dangerous behaviour?
Bumped into this one myself. Really annoying. You probably have BDB as you LDAP back end, just like me. And BDB is scary when it comes to uncommitted transactions or any other type of problem you may have present in your database. Although probably not entirely correct, in order to avoid this hang, I have this in /etc/sysconfig/ldap:
su - -s /bin/bash ldap -c /usr/sbin/slapd_db_recover
If the database is OK, nothing happens. If it is screwed, the recover process gets it to the state where slapd actually starts. Usually, it even works properly :-)
Word of warning: I do dump my LDAP database regularly, so I know that even if db_recover stuffs up my database, I can go back. Unless you're doing something similar, I suggest you read more about what db_recover does before implementing this workaround.
-- Bojan
--- Bojan Smojver bojan@rexursive.com wrote:
Bumped into this one myself. Really annoying. You probably have BDB as you LDAP back end, just like me. And BDB is scary when it comes to uncommitted transactions or any other type of problem you may have present in your database. Although probably not entirely correct, in order to avoid this hang, I have this in /etc/sysconfig/ldap:
su - -s /bin/bash ldap -c /usr/sbin/slapd_db_recover
If the database is OK, nothing happens. If it is screwed, the recover process gets it to the state where slapd actually starts. Usually, it even works properly :-)
It looks like a great solution to me. We might just need to get a notification (email to root?) if the database was corrupted. I hope it can be added in the future ldap releases. Anyway I still think that init scripts should have a timeout managed by the SysV system.
devel@lists.stg.fedoraproject.org