跳转至

MongoDB and SSH Unreachable While Ping Works - OOM Postmortem

Status: ✅ Completed

Created: 2026-02-10
Last updated: 2026-02-27 (doubled MongoDB memory limits)


Incident Summary

  • Symptom: for a period of time, MongoDB and SSH were unreachable, but ping to the server still worked.
  • Environment: single host running multiple heavy services (mongod, elasticsearch, yugabyte, k3s, kibana).
  • Result: service access recovered later, then post-incident forensics were performed.

Key Findings

1) Root cause category

  • Root cause was host-level memory exhaustion (global OOM), not a network link failure.
  • ping can still succeed during OOM because ICMP is lightweight and does not prove TCP service health.

2) Hard evidence from logs

  • Kernel logs showed repeated OOM events:
    • global_oom
    • Out of memory: Killed process ...
    • Victims included coredns and postgres.
  • OOM snapshots showed very high memory usage by mongod.
  • SSH logs included Broken pipe [preauth] around the same period, matching instability under memory pressure.

3) Why mongod looked larger than expected

  • wiredTiger.engineConfig.cacheSizeGB=8 was confirmed active.
  • serverStatus().wiredTiger.cache.maximum bytes configured confirmed 8 GiB.
  • However, mongod RSS can exceed WT cache due to:
    • non-cache allocations (connections, execution memory, index operations)
    • allocator behavior (tcmalloc retained memory, fragmentation)
    • additional internal memory overhead.

4) Clarification about “many mongod processes”

  • The long list in htop was thread view, not many independent MongoDB instances.
  • Linux can display one process with many thread IDs when thread display is enabled.

Actions Taken

1) Enforced memory guardrails

  • Added cgroup memory controls for MongoDB service:
    • MemoryHigh=16G
    • MemoryMax=18G
  • Added cgroup memory controls for Elasticsearch service:
    • MemoryHigh=7G
    • MemoryMax=8G

1.1) Exact implementation (critical fix)

mongod unit file (/usr/lib/systemd/system/mongod.service) includes:

[Service]
MemoryMax=18G
MemoryHigh=16G

elasticsearch service configuration (initial attempt):

[Service]
MemoryMax=8G
MemoryHigh=7G

Important discovery: Initial 8G limit was insufficient for Elasticsearch with 6G heap + ~7.5G direct memory requirement.

Final working configuration (using systemd override):

sudo systemctl edit elasticsearch

Add to the override file:

[Service]
MemoryHigh=14G
MemoryMax=16G

Apply and verify steps:

sudo systemctl daemon-reload
sudo systemctl restart mongod elasticsearch
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax

Observed effective values:

  • mongod: MemoryHigh=17179869184 (16G), MemoryMax=19327352832 (18G)
  • elasticsearch: MemoryHigh=15032385536 (14G), MemoryMax=17179869184 (16G)

Notes:

  • These limits are hard protections at the service cgroup level.
  • This is separate from MongoDB wiredTiger.cacheSizeGB=8 (cache-only limit).
  • MemoryMax prevents one service from exhausting host memory and triggering global OOM again.
  • Elasticsearch requires higher limit due to JVM heap + direct memory + overhead.

1.2) Elasticsearch JVM configuration details

Problem encountered during restart:

When attempting to restart Elasticsearch with the initial 8G systemd limit, the service was killed by OOM killer despite heap being configured at 6G:

# Evidence from logs
journalctl -u elasticsearch
# Output showed:
# elasticsearch.service: A process of this unit has been killed by the OOM killer.

Root cause: Elasticsearch memory usage = JVM heap + direct memory + overhead

# Check heap configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null
# Output:
# /etc/elasticsearch/jvm.options.d/heap.options:-Xms6g
# /etc/elasticsearch/jvm.options.d/heap.options:-Xmx6g

# Check process command line (shows actual memory parameters)
ps aux | grep elasticsearch | grep java
# Revealed: -XX:MaxDirectMemorySize=8053063680 (≈7.5G)

Memory calculation: - Heap: 6G - Direct memory: ~7.5G - Overhead (threads, native memory, etc.): ~1-2G - Total: ~14-15G

JVM configuration files (/etc/elasticsearch/jvm.options.d/):

Elasticsearch uses a modular JVM configuration approach where settings can be organized in separate files under jvm.options.d/ directory instead of editing the main jvm.options file:

# Heap configuration
cat /etc/elasticsearch/jvm.options.d/heap.options
-Xms6g
-Xmx6g

# Optional: Reduce direct memory if needed
# (Default is calculated automatically, can be ~50% of heap)
echo "-XX:MaxDirectMemorySize=3g" > /etc/elasticsearch/jvm.options.d/memory.options

Advantages of jvm.options.d/ approach: - Package updates don't overwrite custom settings - Clear separation of concerns (heap, memory, GC settings, etc.) - Easier to manage and version control

Solution: Raised systemd limit to 16G to accommodate actual memory needs.

Alternative approach (if memory is constrained): - Reduce heap to 4G: -Xms4g -Xmx4g - Limit direct memory: -XX:MaxDirectMemorySize=2g - Total would be ~7-8G (fits in original 8G limit)

2) Verified effect

  • systemctl show confirmed new memory limits are loaded.
  • Post-change process view showed major reduction in MongoDB memory usage.
  • Overall host usage dropped to around 12G / 62G, leaving safe headroom.

3) MongoDB memory limit doubled (2026-02-27)

MongoDB 内存配置不足,将 systemd cgroup 限制翻倍。由于 /etc/systemd/system/mongod.service.d/override.conf 不存在,直接修改 unit 文件:

sudo sed -i 's/MemoryMax=18G/MemoryMax=36G/; s/MemoryHigh=16G/MemoryHigh=32G/' /usr/lib/systemd/system/mongod.service
sudo systemctl daemon-reload
sudo systemctl restart mongod

Verified effective values after change:

  • mongod: MemoryHigh=34359738368 (32G), MemoryMax=38654705664 (36G)

Note: wiredTiger.cacheSizeGB remains at 8G (cache-only limit, separate from cgroup).

Final Conclusion

  • Primary fault pattern: resource contention and host OOM, not pure connectivity failure.
  • Most likely pressure source during incident window: high memory usage from major data services, with MongoDB as a significant contributor.
  • Current mitigations substantially reduce recurrence risk.

One-line Summary

  • 原因是主机发生 global OOM 导致 MongoDB/SSH 在 ping 可达时仍短暂不可用,解决方案是给 mongodelasticsearch 配置并验证 systemd 的 MemoryHigh/MemoryMax 内存硬限制以防再次内存挤爆。
  • Daily/regular checks:
    • free -h
    • dmesg -T | egrep -i "oom|out of memory|killed process"
  • If OOM reappears, expand to:
    • workload-level memory profiling for MongoDB, Elasticsearch, and Yugabyte
    • stricter per-service limits or host/service separation.

Commands Used During Investigation (Reference)

Initial OOM investigation

systemctl list-units --type=service | egrep -i "ssh|mongo|containerd|k3s|kube"
sudo grep -Ei "oom|killed process|out of memory|sshd|mongo|mongod" /var/log/messages /var/log/secure
dmesg -T | egrep -i "oom|killed process|out of memory|I/O error|blocked for more than"
mongosh --quiet --eval 'db.serverCmdLineOpts().parsed.storage.wiredTiger.engineConfig.cacheSizeGB'
mongosh --quiet --eval 's=db.serverStatus().wiredTiger.cache; printjson({max:s["maximum bytes configured"], now:s["bytes currently in the cache"]})'
cat /proc/<mongod_pid>/cmdline | tr '\0' ' '
systemctl show mongod -p MemoryHigh -p MemoryMax
systemctl show elasticsearch -p MemoryHigh -p MemoryMax
ps -eo pid,comm,rss,%mem --sort=-rss | head -n 15

Elasticsearch restart hang troubleshooting

# Check service status when stuck
systemctl status elasticsearch --no-pager

# Check processes
ps aux | grep elasticsearch
pgrep -a java | grep elasticsearch

# Monitor real-time logs (most critical)
sudo journalctl -u elasticsearch -f --since "5 minutes ago"

# Check Elasticsearch logs
sudo tail -f /var/log/elasticsearch/elasticsearch.log

# Check JVM configuration
grep "^-Xms\|^-Xmx" /etc/elasticsearch/jvm.options /etc/elasticsearch/jvm.options.d/* 2>/dev/null

# Check systemd memory limits
systemctl show elasticsearch -p TimeoutStartSec -p MemoryHigh -p MemoryMax

# Check memory availability
free -h

# Check for recent OOM events
dmesg -T | egrep -i "oom.*elasticsearch" | tail -n 20

# Safe restart procedure
sudo systemctl stop elasticsearch
sudo pkill -9 -f elasticsearch  # if stop hangs
sudo systemctl start elasticsearch
sudo journalctl -u elasticsearch -f