George Starcher - Splunk uLimits and You

Splunk uLimits and You

Most folks are familiar with the concept of file descriptors in Unix/Linux. It gets mentioned in the Splunk docs for system requirements under the section “Considerations regarding file descriptor limits (FDs) on *nix systems” and for troubleshooting.

I run a very high volume index cluster on a daily basis. Complete with Splunk Enterprise Security. One thing I have seen is if you have timestamps off you can get a VERY LARGE number of buckets for low overall raw data size. If you see nearly 10000 buckets for only several hundred GB of data then you have that problem. Keep in mind that is a lot of file descriptors potentially in use. You should check your incoming logs and you will likely find some nasty multi line log file having a line breaking issue where some large integer is getting parsed as an epoch time and causing buckets with timestamps way back in time.

It got me thinking about the number of open files though. Especially, when also being concerned with all the buckets for data model accelerations to be built for supporting the Enterprise Security application. Maybe FD limits have been interfering with my data model acceleration bucket builds.

Then we had a couple of indexers spontaneously crash their splunkd processes. With an error indicating file descriptor limit problems.

I discussed it with my main Splunk partner in crime, Duane Waddle. He explained that if a process starts on it’s own without a user session that Linux might not honor ulimits from limits.conf. So even though we had done the right things accounting for ulimits, Transparent Huge Pages etc that we were still likely getting hosed.

Such as this example from /etc/security/limits.conf using a section like below for a high volume indexer in a cluster:

 #splunk
 root        soft    nofile           32768
 root        hard    nofile           65536
 splunk      soft    nofile           32768
 splunk      hard    nofile           65536

You might be getting the 4096 default if Splunk is kicking off via the enable boot-start option.

You can test this by logging into your server then do the following:

sudo -i
Q=`head -1 /opt/splunk/var/run/splunk/splunkd.pid` && ps -fp $Q && cat /proc/$Q/limits

Check the results looking for the Max open files.

UID         PID   PPID  C STIME TTY          TIME CMD
root      82110      1  5 02:50 ?        00:00:01 splunkd -p 8089 restart
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size              unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size         0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             3815                 3815                 processes 
Max open files             4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks             unlimited            unlimited            locks     
Max pending signals       3815                 3815                 signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         20                   20                   
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

Duane suggested editing the Splunk init file. My coworker Matt Uebel ran with that and came up with the follow quick commands to make that edit. Use the following commands substituting your desired limits values.

sed -i '/init.d\/functions/a ulimit -Sn 32768' /etc/init.d/splunk
sed -i '/init.d\/functions/a ulimit -Hn 65536' /etc/init.d/splunk

Now when your system fully reboots and Splunk starts via enable-bootstart without a user session you should still get the desired ulimits values.