On Thu, Dec 11, 2008 at 08:59:40AM +0000, Russell King wrote:
> > Hacky patch that mlock()s rpmdb's environment mmap(2)s,
in order to
> > attempt to avoid spurious rpmdb corruption issues on Linux that seem
> > to be somehow related to pagein/pageout occuring.
The relevent questions are:
1. which kernel version is this occuring with?
2. what device is the swap on?
3. which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption
when using rpm. I started digging into it, and ran into an issue with
fsx-linux, which you reported to linux-arch@ here:
Unfortunately, the issue seen with fsx-linux turned out to be unrelated
to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never
meant to go upstream!) to see if that would make it go away, and it
seems to have made it go away, since I haven't managed to reproduce
it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in
qemu-system-arm, an environment in which cache aliasing effects don't
exist, so I abandoned the theory of it being a cache aliasing issue at
the time and theorised that somehow a dirty page was having its dirty
data discarded and an older stale copy being swapped back in, although
I've never been able to prove this -- after spending a week
unsuccessfully trying to hunt it down at the time I haven't spent any
more time on it since. (And everyone I mentioned this to seemed to
agree that shared writeable mmap() is icky and yuck and booh and "hard
to get right", and that didn't increase my motivation to look into it
I don't even know if it's an issue anymore in recent kernels. I don't
even know if it's (assuming that it _is_ indeed a kernel issue) an
arch/arm issue or a kernel-wide issue that simply occurs more often on
ARM because ARM systems generally have less memory and therefore
generally have more memory pressure. (There's certainly enough reports
of rpm database corruption on x86 as well, but in almost every report
there are more factors involved, such as people Ctrl-C'ing and killing
rpm processes as they are manipulating the database, etc.)