Serious write-performance problems on RHEL6

List overview All Threads
Download

newer

older

Re: [389-users] Serious...

Operations Error on object

Steve Holden

31 Mar 2014 31 Mar '14

10:34 a.m.

Hi, folks

I'm hoping to use 389 DS to replace our ancient Sun DS 5.2 service.

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

Things looked promising: the old directory data has been imported (with only minor changes), read requests perform reasonably well, and isolated write requests are ok.

However, even after a small number (typically 6) of consecutive write requests (basic attribute changes to a single entry, say) the ns-slapd process hits >100% CPU (of 2 CPUs) and stays there for *at least* 10 seconds per update, and blocks the client process attempting the update.

I can't see anything obvious in the performance counters or the logs to suggest a problem. The updates are logged with "etime=0" in the access log.

I've tried enabling different log levels in the error log. Is it normal for the Plugin level to show constant re-scanning of CoS templates?

I'd be very grateful for any suggestions of how I can go about tracing where the Problem might be and how to resolve it...

Best wishes, Steve

Details

The RHEL6.5 server is a VMware ESXi VM with 8GB RAM and 2x CPUs, and is running the latest EPEL package for RHEL6 (v1.2.11.15-32). (After a package upgrade a few weeks ago, I ran "setup-ds-admin.pl -u").

The directory contains in excess of 200,000 entries, and its databases consume over 3.5GB on disk.

The userRoot database has therefore been configured with a 4GB cache (and the general LDBM max cache is set at 6GB - though it's quite possible I haven't understood how to set these correctly - I've tried smaller numbers of each).

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

No replication has been configured so far.

___________________________________________________________ This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________

Show replies by date

Dustin Rice

31 Mar 31 Mar

10:37 a.m.

When you did your import, did you make sure that your indexes got rebuilt?

On 03/31/2014 08:34 AM, Steve Holden wrote:

...

Hi, folks

I'm hoping to use 389 DS to replace our ancient Sun DS 5.2 service.

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

Things looked promising: the old directory data has been imported (with only minor changes), read requests perform reasonably well, and isolated write requests are ok.

However, even after a small number (typically 6) of consecutive write requests (basic attribute changes to a single entry, say) the ns-slapd process hits >100% CPU (of 2 CPUs) and stays there for *at least* 10 seconds per update, and blocks the client process attempting the update.

I can't see anything obvious in the performance counters or the logs to suggest a problem. The updates are logged with "etime=0" in the access log.

I've tried enabling different log levels in the error log. Is it normal for the Plugin level to show constant re-scanning of CoS templates?

I'd be very grateful for any suggestions of how I can go about tracing where the Problem might be and how to resolve it...

Best wishes, Steve

Details

The RHEL6.5 server is a VMware ESXi VM with 8GB RAM and 2x CPUs, and is running the latest EPEL package for RHEL6 (v1.2.11.15-32). (After a package upgrade a few weeks ago, I ran "setup-ds-admin.pl -u").

The directory contains in excess of 200,000 entries, and its databases consume over 3.5GB on disk.

The userRoot database has therefore been configured with a 4GB cache (and the general LDBM max cache is set at 6GB - though it's quite possible I haven't understood how to set these correctly - I've tried smaller numbers of each).

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

No replication has been configured so far.

This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users

-- ===================================== Dustin Rice UNIX System Administrator - CIS Portland State University =====================================

Steve Holden

11:28 a.m.

Hi, Dustin

Thanks for the rapid response.

I did - sorry for not making that clearer (it was buried in the footer).

I added indexes for the equivalent attributes from our current servers, and then re-indexed all attributes by unchecking and re-checking one of the checked boxes for _every_ indexed attribute and clicking "Save".

Since then, I've used the following which I understand re-indexes every attribute by default (but let me know if I've misunderstood!)

/usr/lib64/dirsrv/slapd-${HOSTNAME%%.*}/db2index.pl \ -D "$ADMIN_USER" -w "$ADMIN_PASSWD" -n userRoot -v

Best wishes, Steve

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Dustin Rice Sent: 31 March 2014 16:37 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

When you did your import, did you make sure that your indexes got rebuilt?

On 03/31/2014 08:34 AM, Steve Holden wrote:

...

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

...

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

Dustin Rice

11:58 a.m.

Ah derp, my mistake, I apparently didn't scroll down far enough.

Do you know if the slapd procs are sitting in an IO wait state?

When you run a ps -eLf how many slapd threads are there?

I'm going through a migration from Sun One 5.2 to 389ds also. You can setup a 389 DS replica of a Sun One master. It works out pretty well. My plan is to slowly replace our Sun One replicas with 389 DS replicas that are tied to a 389DS hub. Once all the Sun One replicas have been replace, I then promote that 389 DS hub to master and we're done.

On 03/31/2014 09:28 AM, Steve Holden wrote:

...

Hi, Dustin

Thanks for the rapid response.

I did - sorry for not making that clearer (it was buried in the footer).

I added indexes for the equivalent attributes from our current servers, and then re-indexed all attributes by unchecking and re-checking one of the checked boxes for _every_ indexed attribute and clicking "Save".

Since then, I've used the following which I understand re-indexes every attribute by default (but let me know if I've misunderstood!)

/usr/lib64/dirsrv/slapd-${HOSTNAME%%.*}/db2index.pl \ -D "$ADMIN_USER" -w "$ADMIN_PASSWD" -n userRoot -v

Best wishes, Steve

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Dustin Rice Sent: 31 March 2014 16:37 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

When you did your import, did you make sure that your indexes got rebuilt?

On 03/31/2014 08:34 AM, Steve Holden wrote:

...
I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

...

...
The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users

-- ===================================== Dustin Rice UNIX System Administrator - CIS Portland State University =====================================

Steve Holden

1 Apr 1 Apr

5:39 a.m.

No worries.

I've checked IO issues using iotop, but ns-slapd only makes fleeting appearances there, so I don't *think* it's IO-bound There are 43 separate slapd threads (according to ' ps -efL | grep -c slapd').

I'd wondered about setting the 389 server as a replica of the production servers, but am not feeling that brave yet!

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Dustin Rice Sent: 31 March 2014 17:59 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

Ah derp, my mistake, I apparently didn't scroll down far enough.

Do you know if the slapd procs are sitting in an IO wait state?

When you run a ps -eLf how many slapd threads are there?

On 03/31/2014 09:28 AM, Steve Holden wrote:

...

Hi, Dustin

Thanks for the rapid response.

I did - sorry for not making that clearer (it was buried in the footer).

I added indexes for the equivalent attributes from our current servers, and then re-indexed all attributes by unchecking and re-checking one of the checked boxes for _every_ indexed attribute and clicking "Save".

Since then, I've used the following which I understand re-indexes every attribute by default (but let me know if I've misunderstood!)

/usr/lib64/dirsrv/slapd-${HOSTNAME%%.*}/db2index.pl \ -D "$ADMIN_USER" -w "$ADMIN_PASSWD" -n userRoot -v

Best wishes, Steve

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Dustin Rice Sent: 31 March 2014 16:37 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

When you did your import, did you make sure that your indexes got rebuilt?

On 03/31/2014 08:34 AM, Steve Holden wrote:

...
I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

...

...
The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users

-- ===================================== Dustin Rice UNIX System Administrator - CIS Portland State University ===================================== -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users ___________________________________________________________ This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ ___________________________________________________________ This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________

Ludwig Krispenz

2:33 a.m.

In the phase with high cpu usage, could you run a) top -H -p <pid> to see if there are many threads competing for the cpu or one or two occupying the cpu b) pstack <pid> to see what the threads are doing, sometimes pstack for the complete process doesn't look meaningful, you can also run pstack <tpid> where tpid is one of the threads consuming the cpu

You are on a VM with 2cpus, what is the real HW, there have been problems with RHDS on machines with Numa architecture if the threads of teh process have been distributed to different nodes. What was the HW for SunDS ?

Ludwig

On 03/31/2014 05:34 PM, Steve Holden wrote:

...

Hi, folks

I'm hoping to use 389 DS to replace our ancient Sun DS 5.2 service.

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

Things looked promising: the old directory data has been imported (with only minor changes), read requests perform reasonably well, and isolated write requests are ok.

However, even after a small number (typically 6) of consecutive write requests (basic attribute changes to a single entry, say) the ns-slapd process hits >100% CPU (of 2 CPUs) and stays there for *at least* 10 seconds per update, and blocks the client process attempting the update.

I can't see anything obvious in the performance counters or the logs to suggest a problem. The updates are logged with "etime=0" in the access log.

I've tried enabling different log levels in the error log. Is it normal for the Plugin level to show constant re-scanning of CoS templates?

I'd be very grateful for any suggestions of how I can go about tracing where the Problem might be and how to resolve it...

Best wishes, Steve

Details

The RHEL6.5 server is a VMware ESXi VM with 8GB RAM and 2x CPUs, and is running the latest EPEL package for RHEL6 (v1.2.11.15-32). (After a package upgrade a few weeks ago, I ran "setup-ds-admin.pl -u").

The directory contains in excess of 200,000 entries, and its databases consume over 3.5GB on disk.

The userRoot database has therefore been configured with a 4GB cache (and the general LDBM max cache is set at 6GB - though it's quite possible I haven't understood how to set these correctly - I've tried smaller numbers of each).

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

No replication has been configured so far.

This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users

Steve Holden

5:52 a.m.

Hi, Ludwig

Thanks for taking the time to reply.

I'm using this simple LDIF import file to generate the problem: http://pastebin.com/NyNY650L It generally hangs around the 7th record, and the complete import takes 2m32s!

Parent pid from /var/run/dirsrv/slapd-algieba.pid is 10382, and 'top -H' for it shows:

# top -H -p 10382

top - 11:45:02 up 5 days, 22:27, 9 users, load average: 0.87, 0.35, 0.24 Tasks: 41 total, 1 running, 40 sleeping, 0 stopped, 0 zombie Cpu(s): 1.8%us, 0.2%sy, 0.0%ni, 97.6%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8060964k total, 7886612k used, 174352k free, 215412k buffers Swap: 6291448k total, 11584k used, 6279864k free, 2588264k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10892 dirsrv 20 0 11.9g 6.1g 1.9g R 99.7 79.1 46:19.78 ns-slapd 10881 dirsrv 20 0 11.9g 6.1g 1.9g S 2.0 79.1 40:58.88 ns-slapd ...

Standard top output:

top - 11:29:09 up 5 days, 22:11, 9 users, load average: 0.35, 0.14, 0.16 Tasks: 219 total, 1 running, 218 sleeping, 0 stopped, 0 zombie Cpu(s): 1.8%us, 0.2%sy, 0.0%ni, 97.6%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8060964k total, 7887052k used, 173912k free, 215412k buffers Swap: 6291448k total, 11584k used, 6279864k free, 2587968k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10382 dirsrv 20 0 11.9g 6.1g 1.9g S 100.3 79.1 83:23.85 ns-slapd 25216 root 20 0 15168 1184 824 R 2.0 0.0 0:00.01 top

Output from pstack is here: http://pastebin.com/8LpfbdCb

I'm curious about the number of CoS lines, which I've highlighted. I mention this as enabling Plugins in the error log shows an incredible amount of CoS activity - and the LDIF import above doesn't include any CoS attributes. I'll disable the CoS rules and see whether this helps...

Thanks for the note about hardware. Sounds like it doesn't apply here; details are: VM hardware: Dell PowerEdge R710 2x quad core Xeon. Production hardware: Sun Fire V240 (Sparc, 8G RAM).

Best wishes, Steve

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Ludwig Krispenz Sent: 01 April 2014 08:33 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

Ludwig

On 03/31/2014 05:34 PM, Steve Holden wrote:

...

Hi, folks

I'm hoping to use 389 DS to replace our ancient Sun DS 5.2 service.

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

Things looked promising: the old directory data has been imported (with only minor changes), read requests perform reasonably well, and isolated write requests are ok.

However, even after a small number (typically 6) of consecutive write requests (basic attribute changes to a single entry, say) the ns-slapd process hits >100% CPU (of 2 CPUs) and stays there for *at least* 10 seconds per update, and blocks the client process attempting the update.

I can't see anything obvious in the performance counters or the logs to suggest a problem. The updates are logged with "etime=0" in the access log.

I've tried enabling different log levels in the error log. Is it normal for the Plugin level to show constant re-scanning of CoS templates?

I'd be very grateful for any suggestions of how I can go about tracing where the Problem might be and how to resolve it...

Best wishes, Steve

Details

The RHEL6.5 server is a VMware ESXi VM with 8GB RAM and 2x CPUs, and is running the latest EPEL package for RHEL6 (v1.2.11.15-32). (After a package upgrade a few weeks ago, I ran "setup-ds-admin.pl -u").

The directory contains in excess of 200,000 entries, and its databases consume over 3.5GB on disk.

The userRoot database has therefore been configured with a 4GB cache (and the general LDBM max cache is set at 6GB - though it's quite possible I haven't understood how to set these correctly - I've tried smaller numbers of each).

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

No replication has been configured so far.

Ludwig Krispenz

6:04 a.m.

Hi,

looks like the cos plugin is busy with internal searches, maybe these are unidexed. Could you turn on logging for internal searches in the cos plugin to see what kind of searches are performed: http://directory.fedoraproject.org/wiki/Plugin_Logging

Ludwig

On 04/01/2014 12:52 PM, Steve Holden wrote:

...

Hi, Ludwig

Thanks for taking the time to reply.

I'm using this simple LDIF import file to generate the problem: http://pastebin.com/NyNY650L It generally hangs around the 7th record, and the complete import takes 2m32s!

Parent pid from /var/run/dirsrv/slapd-algieba.pid is 10382, and 'top -H' for it shows:

# top -H -p 10382

top - 11:45:02 up 5 days, 22:27, 9 users, load average: 0.87, 0.35, 0.24 Tasks: 41 total, 1 running, 40 sleeping, 0 stopped, 0 zombie Cpu(s): 1.8%us, 0.2%sy, 0.0%ni, 97.6%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8060964k total, 7886612k used, 174352k free, 215412k buffers Swap: 6291448k total, 11584k used, 6279864k free, 2588264k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10892 dirsrv 20 0 11.9g 6.1g 1.9g R 99.7 79.1 46:19.78 ns-slapd 10881 dirsrv 20 0 11.9g 6.1g 1.9g S 2.0 79.1 40:58.88 ns-slapd ...

Standard top output:

top - 11:29:09 up 5 days, 22:11, 9 users, load average: 0.35, 0.14, 0.16 Tasks: 219 total, 1 running, 218 sleeping, 0 stopped, 0 zombie Cpu(s): 1.8%us, 0.2%sy, 0.0%ni, 97.6%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8060964k total, 7887052k used, 173912k free, 215412k buffers Swap: 6291448k total, 11584k used, 6279864k free, 2587968k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10382 dirsrv 20 0 11.9g 6.1g 1.9g S 100.3 79.1 83:23.85 ns-slapd 25216 root 20 0 15168 1184 824 R 2.0 0.0 0:00.01 top

Output from pstack is here: http://pastebin.com/8LpfbdCb

I'm curious about the number of CoS lines, which I've highlighted. I mention this as enabling Plugins in the error log shows an incredible amount of CoS activity - and the LDIF import above doesn't include any CoS attributes. I'll disable the CoS rules and see whether this helps...

Thanks for the note about hardware. Sounds like it doesn't apply here; details are: VM hardware: Dell PowerEdge R710 2x quad core Xeon. Production hardware: Sun Fire V240 (Sparc, 8G RAM).

Best wishes, Steve

-----Original Message----- From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Ludwig Krispenz Sent: 01 April 2014 08:33 To: 389-users@lists.fedoraproject.org Subject: Re: [389-users] Serious write-performance problems on RHEL6

In the phase with high cpu usage, could you run a) top -H -p <pid> to see if there are many threads competing for the cpu or one or two occupying the cpu b) pstack <pid> to see what the threads are doing, sometimes pstack for the complete process doesn't look meaningful, you can also run pstack <tpid> where tpid is one of the threads consuming the cpu

You are on a VM with 2cpus, what is the real HW, there have been problems with RHDS on machines with Numa architecture if the threads of teh process have been distributed to different nodes. What was the HW for SunDS ?

Ludwig

On 03/31/2014 05:34 PM, Steve Holden wrote:

...
Hi, folks

I'm hoping to use 389 DS to replace our ancient Sun DS 5.2 service.

I've hit a snag with my 389 development server; it's performance far worse than the 10 year-old servers it's intended to replace.

Things looked promising: the old directory data has been imported (with only minor changes), read requests perform reasonably well, and isolated write requests are ok.

However, even after a small number (typically 6) of consecutive write requests (basic attribute changes to a single entry, say) the ns-slapd process hits >100% CPU (of 2 CPUs) and stays there for *at least* 10 seconds per update, and blocks the client process attempting the update.

I can't see anything obvious in the performance counters or the logs to suggest a problem. The updates are logged with "etime=0" in the access log.

I've tried enabling different log levels in the error log. Is it normal for the Plugin level to show constant re-scanning of CoS templates?

I'd be very grateful for any suggestions of how I can go about tracing where the Problem might be and how to resolve it...

Best wishes, Steve

Details

The RHEL6.5 server is a VMware ESXi VM with 8GB RAM and 2x CPUs, and is running the latest EPEL package for RHEL6 (v1.2.11.15-32). (After a package upgrade a few weeks ago, I ran "setup-ds-admin.pl -u").

The directory contains in excess of 200,000 entries, and its databases consume over 3.5GB on disk.

The userRoot database has therefore been configured with a 4GB cache (and the general LDBM max cache is set at 6GB - though it's quite possible I haven't understood how to set these correctly - I've tried smaller numbers of each).

The directory contains custom attributes, some of which are CoS, and many of which have been indexed (AFAIK, all attributes have been re-indexed).

No replication has been configured so far.

This email has been scanned by MessageLabs' Email Security System on behalf of the University of Brighton. For more information see http://www.brighton.ac.uk/is/spam/ ___________________________________________________________ -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users

3685

Age (days ago)

3686

Last active (days ago)

389-users@lists.fedoraproject.org

7 comments

3 participants

tags (0)

participants (3)

Dustin Rice
Ludwig Krispenz
Steve Holden