MongoDB and SSH Unreachable While Ping Works - OOM Postmortem¶
Status: ✅ Completed
Created: 2026-02-10
Last updated: 2026-02-27 (doubled MongoDB memory limits)
Incident Summary¶
- Symptom: for a period of time, MongoDB and SSH were unreachable, but
pingto the server still worked. - Environment: single host running multiple heavy services (
mongod,elasticsearch,yugabyte,k3s,kibana). - Result: service access recovered later, then post-incident forensics were performed.
Key Findings¶
1) Root cause category¶
- Root cause was host-level memory exhaustion (global OOM), not a network link failure.
pingcan still succeed during OOM because ICMP is lightweight and does not prove TCP service health.
2) Hard evidence from logs¶
- Kernel logs showed repeated OOM events:
global_oomOut of memory: Killed process ...- Victims included
corednsandpostgres.
- OOM snapshots showed very high memory usage by
mongod. - SSH logs included
Broken pipe [preauth]around the same period, matching instability under memory pressure.
3) Why mongod looked larger than expected¶
wiredTiger.engineConfig.cacheSizeGB=8was confirmed active.serverStatus().wiredTiger.cache.maximum bytes configuredconfirmed8 GiB.- However,
mongodRSS can exceed WT cache due to:- non-cache allocations (connections, execution memory, index operations)
- allocator behavior (tcmalloc retained memory, fragmentation)
- additional internal memory overhead.
4) Clarification about “many mongod processes”¶
- The long list in
htopwas thread view, not many independent MongoDB instances. - Linux can display one process with many thread IDs when thread display is enabled.
Actions Taken¶
1) Enforced memory guardrails¶
- Added cgroup memory controls for MongoDB service:
MemoryHigh=16GMemoryMax=18G
- Added cgroup memory controls for Elasticsearch service:
MemoryHigh=7GMemoryMax=8G
1.1) Exact implementation (critical fix)¶
mongod unit file (/usr/lib/systemd/system/mongod.service) includes:
elasticsearch service configuration (initial attempt):
Important discovery: Initial 8G limit was insufficient for Elasticsearch with 6G heap + ~7.5G direct memory requirement.
Final working configuration (using systemd override):
Add to the override file:
Apply and verify steps:
sudo systemctl daemon-reload
sudo systemctl restart mongod elasticsearch
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax
Observed effective values:
mongod:MemoryHigh=17179869184(16G),MemoryMax=19327352832(18G)elasticsearch:MemoryHigh=15032385536(14G),MemoryMax=17179869184(16G)
Notes:
- These limits are hard protections at the service cgroup level.
- This is separate from MongoDB
wiredTiger.cacheSizeGB=8(cache-only limit). MemoryMaxprevents one service from exhausting host memory and triggering global OOM again.- Elasticsearch requires higher limit due to JVM heap + direct memory + overhead.
1.2) Elasticsearch JVM configuration details¶
Problem encountered during restart:
When attempting to restart Elasticsearch with the initial 8G systemd limit, the service was killed by OOM killer despite heap being configured at 6G:
# Evidence from logs
journalctl -u elasticsearch
# Output showed:
# elasticsearch.service: A process of this unit has been killed by the OOM killer.
Root cause: Elasticsearch memory usage = JVM heap + direct memory + overhead
# Check heap configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null
# Output:
# /etc/elasticsearch/jvm.options.d/heap.options:-Xms6g
# /etc/elasticsearch/jvm.options.d/heap.options:-Xmx6g
# Check process command line (shows actual memory parameters)
ps aux | grep elasticsearch | grep java
# Revealed: -XX:MaxDirectMemorySize=8053063680 (≈7.5G)
Memory calculation: - Heap: 6G - Direct memory: ~7.5G - Overhead (threads, native memory, etc.): ~1-2G - Total: ~14-15G
JVM configuration files (/etc/elasticsearch/jvm.options.d/):
Elasticsearch uses a modular JVM configuration approach where settings can be organized in separate files under jvm.options.d/ directory instead of editing the main jvm.options file:
# Heap configuration
cat /etc/elasticsearch/jvm.options.d/heap.options
-Xms6g
-Xmx6g
# Optional: Reduce direct memory if needed
# (Default is calculated automatically, can be ~50% of heap)
echo "-XX:MaxDirectMemorySize=3g" > /etc/elasticsearch/jvm.options.d/memory.options
Advantages of jvm.options.d/ approach:
- Package updates don't overwrite custom settings
- Clear separation of concerns (heap, memory, GC settings, etc.)
- Easier to manage and version control
Solution: Raised systemd limit to 16G to accommodate actual memory needs.
Alternative approach (if memory is constrained):
- Reduce heap to 4G: -Xms4g -Xmx4g
- Limit direct memory: -XX:MaxDirectMemorySize=2g
- Total would be ~7-8G (fits in original 8G limit)
2) Verified effect¶
systemctl showconfirmed new memory limits are loaded.- Post-change process view showed major reduction in MongoDB memory usage.
- Overall host usage dropped to around
12G / 62G, leaving safe headroom.
3) MongoDB memory limit doubled (2026-02-27)¶
MongoDB 内存配置不足,将 systemd cgroup 限制翻倍。由于 /etc/systemd/system/mongod.service.d/override.conf 不存在,直接修改 unit 文件:
sudo sed -i 's/MemoryMax=18G/MemoryMax=36G/; s/MemoryHigh=16G/MemoryHigh=32G/' /usr/lib/systemd/system/mongod.service
sudo systemctl daemon-reload
sudo systemctl restart mongod
Verified effective values after change:
mongod:MemoryHigh=34359738368(32G),MemoryMax=38654705664(36G)
Note: wiredTiger.cacheSizeGB remains at 8G (cache-only limit, separate from cgroup).
Final Conclusion¶
- Primary fault pattern: resource contention and host OOM, not pure connectivity failure.
- Most likely pressure source during incident window: high memory usage from major data services, with MongoDB as a significant contributor.
- Current mitigations substantially reduce recurrence risk.
One-line Summary¶
- 原因是主机发生 global OOM 导致 MongoDB/SSH 在
ping可达时仍短暂不可用,解决方案是给mongod和elasticsearch配置并验证 systemd 的MemoryHigh/MemoryMax内存硬限制以防再次内存挤爆。
Recommended Ongoing Checks¶
- Daily/regular checks:
free -hdmesg -T | egrep -i "oom|out of memory|killed process"
- If OOM reappears, expand to:
- workload-level memory profiling for MongoDB, Elasticsearch, and Yugabyte
- stricter per-service limits or host/service separation.
Commands Used During Investigation (Reference)¶
Initial OOM investigation¶
systemctl list-units --type=service | egrep -i "ssh|mongo|containerd|k3s|kube"
sudo grep -Ei "oom|killed process|out of memory|sshd|mongo|mongod" /var/log/messages /var/log/secure
dmesg -T | egrep -i "oom|killed process|out of memory|I/O error|blocked for more than"
mongosh --quiet --eval 'db.serverCmdLineOpts().parsed.storage.wiredTiger.engineConfig.cacheSizeGB'
mongosh --quiet --eval 's=db.serverStatus().wiredTiger.cache; printjson({max:s["maximum bytes configured"], now:s["bytes currently in the cache"]})'
cat /proc/<mongod_pid>/cmdline | tr '\0' ' '
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax
ps -eo pid,comm,rss,%mem --sort=-rss | head -n 15
Elasticsearch restart hang troubleshooting¶
# Check service status when stuck
systemctl status elasticsearch --no-pager
# Check processes
ps aux | grep elasticsearch
pgrep -a java | grep elasticsearch
# Monitor real-time logs (most critical)
sudo journalctl -u elasticsearch -f --since "5 minutes ago"
# Check Elasticsearch logs
sudo tail -f /var/log/elasticsearch/elasticsearch.log
# Check JVM configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null
# Check systemd memory limits
systemctl show elasticsearch -p TimeoutStartSec -p MemoryHigh -p MemoryMax
# Check memory availability
free -h
# Check for recent OOM events
dmesg -T | egrep -i "oom.*elasticsearch" | tail -n 20
# Safe restart procedure
sudo systemctl stop elasticsearch
sudo pkill -9 -f elasticsearch # if stop hangs
sudo systemctl start elasticsearch
sudo journalctl -u elasticsearch -f