MongoDB and SSH Unreachable While Ping Works - OOM Postmortem¶

Status: ✅ Completed

Created: 2026-02-10
Last updated: 2026-02-27 (doubled MongoDB memory limits)

Incident Summary¶

Symptom: for a period of time, MongoDB and SSH were unreachable, but ping to the server still worked.
Environment: single host running multiple heavy services (mongod, elasticsearch, yugabyte, k3s, kibana).
Result: service access recovered later, then post-incident forensics were performed.

Key Findings¶

1) Root cause category¶

Root cause was host-level memory exhaustion (global OOM), not a network link failure.
ping can still succeed during OOM because ICMP is lightweight and does not prove TCP service health.

2) Hard evidence from logs¶

Kernel logs showed repeated OOM events:
- global_oom
- Out of memory: Killed process ...
- Victims included coredns and postgres.
OOM snapshots showed very high memory usage by mongod.
SSH logs included Broken pipe [preauth] around the same period, matching instability under memory pressure.

3) Why `mongod` looked larger than expected¶

wiredTiger.engineConfig.cacheSizeGB=8 was confirmed active.
serverStatus().wiredTiger.cache.maximum bytes configured confirmed 8 GiB.
However, mongod RSS can exceed WT cache due to:
- non-cache allocations (connections, execution memory, index operations)
- allocator behavior (tcmalloc retained memory, fragmentation)
- additional internal memory overhead.

4) Clarification about “many mongod processes”¶

The long list in htop was thread view, not many independent MongoDB instances.
Linux can display one process with many thread IDs when thread display is enabled.

Actions Taken¶

1) Enforced memory guardrails¶

Added cgroup memory controls for MongoDB service:
- MemoryHigh=16G
- MemoryMax=18G
Added cgroup memory controls for Elasticsearch service:
- MemoryHigh=7G
- MemoryMax=8G

1.1) Exact implementation (critical fix)¶

mongod unit file (/usr/lib/systemd/system/mongod.service) includes:

[Service]
MemoryMax=18G
MemoryHigh=16G

elasticsearch service configuration (initial attempt):

[Service]
MemoryMax=8G
MemoryHigh=7G

Important discovery: Initial 8G limit was insufficient for Elasticsearch with 6G heap + ~7.5G direct memory requirement.

Final working configuration (using systemd override):

sudo systemctl edit elasticsearch

Add to the override file:

[Service]
MemoryHigh=14G
MemoryMax=16G

Apply and verify steps:

sudo systemctl daemon-reload
sudo systemctl restart mongod elasticsearch
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax

Observed effective values:

mongod: MemoryHigh=17179869184 (16G), MemoryMax=19327352832 (18G)
elasticsearch: MemoryHigh=15032385536 (14G), MemoryMax=17179869184 (16G)

Notes:

These limits are hard protections at the service cgroup level.
This is separate from MongoDB wiredTiger.cacheSizeGB=8 (cache-only limit).
MemoryMax prevents one service from exhausting host memory and triggering global OOM again.
Elasticsearch requires higher limit due to JVM heap + direct memory + overhead.

1.2) Elasticsearch JVM configuration details¶

Problem encountered during restart:

When attempting to restart Elasticsearch with the initial 8G systemd limit, the service was killed by OOM killer despite heap being configured at 6G:

# Evidence from logs
journalctl -u elasticsearch
# Output showed:
# elasticsearch.service: A process of this unit has been killed by the OOM killer.

Root cause: Elasticsearch memory usage = JVM heap + direct memory + overhead

# Check heap configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null
# Output:
# /etc/elasticsearch/jvm.options.d/heap.options:-Xms6g
# /etc/elasticsearch/jvm.options.d/heap.options:-Xmx6g

# Check process command line (shows actual memory parameters)
ps aux | grep elasticsearch | grep java
# Revealed: -XX:MaxDirectMemorySize=8053063680 (≈7.5G)

Memory calculation: - Heap: 6G - Direct memory: ~7.5G - Overhead (threads, native memory, etc.): ~1-2G - Total: ~14-15G

JVM configuration files (/etc/elasticsearch/jvm.options.d/):

Elasticsearch uses a modular JVM configuration approach where settings can be organized in separate files under jvm.options.d/ directory instead of editing the main jvm.options file:

# Heap configuration
cat /etc/elasticsearch/jvm.options.d/heap.options
-Xms6g
-Xmx6g

# Optional: Reduce direct memory if needed
# (Default is calculated automatically, can be ~50% of heap)
echo "-XX:MaxDirectMemorySize=3g" > /etc/elasticsearch/jvm.options.d/memory.options

Advantages of jvm.options.d/ approach: - Package updates don't overwrite custom settings - Clear separation of concerns (heap, memory, GC settings, etc.) - Easier to manage and version control

Solution: Raised systemd limit to 16G to accommodate actual memory needs.

Alternative approach (if memory is constrained): - Reduce heap to 4G: -Xms4g -Xmx4g - Limit direct memory: -XX:MaxDirectMemorySize=2g - Total would be ~7-8G (fits in original 8G limit)

2) Verified effect¶

systemctl show confirmed new memory limits are loaded.
Post-change process view showed major reduction in MongoDB memory usage.
Overall host usage dropped to around 12G / 62G, leaving safe headroom.

3) MongoDB memory limit doubled (2026-02-27)¶

MongoDB 内存配置不足，将 systemd cgroup 限制翻倍。由于 /etc/systemd/system/mongod.service.d/override.conf 不存在，直接修改 unit 文件：

sudo sed -i 's/MemoryMax=18G/MemoryMax=36G/; s/MemoryHigh=16G/MemoryHigh=32G/' /usr/lib/systemd/system/mongod.service
sudo systemctl daemon-reload
sudo systemctl restart mongod

Verified effective values after change:

mongod: MemoryHigh=34359738368 (32G), MemoryMax=38654705664 (36G)

Note: wiredTiger.cacheSizeGB remains at 8G (cache-only limit, separate from cgroup).

Final Conclusion¶

Primary fault pattern: resource contention and host OOM, not pure connectivity failure.
Most likely pressure source during incident window: high memory usage from major data services, with MongoDB as a significant contributor.
Current mitigations substantially reduce recurrence risk.

One-line Summary¶

原因是主机发生 global OOM 导致 MongoDB/SSH 在 ping 可达时仍短暂不可用，解决方案是给 mongod 和 elasticsearch 配置并验证 systemd 的 MemoryHigh/MemoryMax 内存硬限制以防再次内存挤爆。

Recommended Ongoing Checks¶

Daily/regular checks:
- free -h
- dmesg -T | egrep -i "oom|out of memory|killed process"
If OOM reappears, expand to:
- workload-level memory profiling for MongoDB, Elasticsearch, and Yugabyte
- stricter per-service limits or host/service separation.

Commands Used During Investigation (Reference)¶

Initial OOM investigation¶

systemctl list-units --type=service | egrep -i "ssh|mongo|containerd|k3s|kube"
sudo grep -Ei "oom|killed process|out of memory|sshd|mongo|mongod" /var/log/messages /var/log/secure
dmesg -T | egrep -i "oom|killed process|out of memory|I/O error|blocked for more than"
mongosh --quiet --eval 'db.serverCmdLineOpts().parsed.storage.wiredTiger.engineConfig.cacheSizeGB'
mongosh --quiet --eval 's=db.serverStatus().wiredTiger.cache; printjson({max:s["maximum bytes configured"], now:s["bytes currently in the cache"]})'
cat /proc/<mongod_pid>/cmdline | tr '\0' ' '
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax
ps -eo pid,comm,rss,%mem --sort=-rss | head -n 15

Elasticsearch restart hang troubleshooting¶

# Check service status when stuck
systemctl status elasticsearch --no-pager

# Check processes
ps aux | grep elasticsearch
pgrep -a java | grep elasticsearch

# Monitor real-time logs (most critical)
sudo journalctl -u elasticsearch -f --since "5 minutes ago"

# Check Elasticsearch logs
sudo tail -f /var/log/elasticsearch/elasticsearch.log

# Check JVM configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null

# Check systemd memory limits
systemctl show elasticsearch -p TimeoutStartSec -p MemoryHigh -p MemoryMax

# Check memory availability
free -h

# Check for recent OOM events
dmesg -T | egrep -i "oom.*elasticsearch" | tail -n 20

# Safe restart procedure
sudo systemctl stop elasticsearch
sudo pkill -9 -f elasticsearch  # if stop hangs
sudo systemctl start elasticsearch
sudo journalctl -u elasticsearch -f