Most folks are familiar with the concept of file descriptors in Unix/Linux. It gets mentioned in the Splunk docs for system requirements under the section “Considerations regarding file descriptor limits (FDs) on *nix systems” and for troubleshooting.
I run a very high volume index cluster on a daily basis. Complete with Splunk Enterprise Security. One thing I have seen is if you have timestamps off you can get a VERY LARGE number of buckets for low overall raw data size. If you see nearly 10000 buckets for only several hundred GB of data then you have that problem. Keep in mind that is a lot of file descriptors potentially in use. You should check your incoming logs and you will likely find some nasty multi line log file having a line breaking issue where some large integer is getting parsed as an epoch time and causing buckets with timestamps way back in time.
It got me thinking about the number of open files though. Especially, when also being concerned with all the buckets for data model accelerations to be built for supporting the Enterprise Security application. Maybe FD limits have been interfering with my data model acceleration bucket builds.
Then we had a couple of indexers spontaneously crash their splunkd processes. With an error indicating file descriptor limit problems.
I discussed it with my main Splunk partner in crime, Duane Waddle. He explained that if a process starts on it’s own without a user session that Linux might not honor ulimits from limits.conf. So even though we had done the right things accounting for ulimits, Transparent Huge Pages etc that we were still likely getting hosed.
Such as this example from /etc/security/limits.conf using a section like below for a high volume indexer in a cluster:
#splunk
root soft nofile 32768
root hard nofile 65536
splunk soft nofile 32768
splunk hard nofile 65536
You might be getting the 4096 default if Splunk is kicking off via the enable boot-start option.
You can test this by logging into your server then do the following:
sudo -i
Q=`head -1 /opt/splunk/var/run/splunk/splunkd.pid` && ps -fp $Q && cat /proc/$Q/limits
Check the results looking for the Max open files.
UID PID PPID C STIME TTY TIME CMD
root 82110 1 5 02:50 ? 00:00:01 splunkd -p 8089 restart
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 3815 3815 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3815 3815 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 20 20
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Duane suggested editing the Splunk init file. My coworker Matt Uebel ran with that and came up with the follow quick commands to make that edit. Use the following commands substituting your desired limits values.
sed -i '/init.d\/functions/a ulimit -Sn 32768' /etc/init.d/splunk
sed -i '/init.d\/functions/a ulimit -Hn 65536' /etc/init.d/splunk
Now when your system fully reboots and Splunk starts via enable-bootstart without a user session you should still get the desired ulimits values.