Crash Handling and Resilience
Game servers crash. Hardware fails. Memory runs out. An edge case in your physics engine hits a null pointer. How your server handles these situations determines whether players lose minutes of progress or hours.
Graceful Shutdown with Signals
Section titled “Graceful Shutdown with Signals”Before talking about crashes, make sure your server handles normal shutdowns correctly. When a hosting platform or admin wants to stop the server, it sends a signal.
SIGTERM: “Please shut down.” Your server should:
- Stop accepting new connections
- Notify connected players
- Save all world and player data
- Flush pending log output
- Exit with code 0
SIGINT (Ctrl+C): Should behave identically to SIGTERM.
Your server needs a signal handler that intercepts these signals and runs the shutdown sequence. Without one, the operating system terminates the process immediately, and all unsaved data is lost.
Shutdown timeout: Hosting platforms give your server a window (typically 10-30 seconds) to shut down gracefully. If the server hasn’t exited by then, the platform sends SIGKILL, which cannot be intercepted. Design your shutdown sequence to complete within 10 seconds. If your save data is so large that saving takes longer, you need a faster save mechanism or more frequent auto-saves.
Exit Codes
Section titled “Exit Codes”Exit codes tell the hosting platform what happened when your server stopped.
| Code | Meaning | Platform Response |
|---|---|---|
| 0 | Clean shutdown | Normal. No action needed. |
| 1 | General error / crash | Log the error. May auto-restart. |
| 137 | Killed by SIGKILL (128+9) | Server failed to shut down gracefully, or was killed by the OS OOM killer. Investigate. |
| 139 | Segmentation fault (128+11) | Crash. Auto-restart and flag for investigation. |
Use exit code 0 exclusively for intentional, clean shutdowns. Any non-zero exit code signals a problem. If you can distinguish between different failure modes, use different codes (e.g., 2 for config error, 3 for failed world load).
Crash Dumps and Diagnostics
Section titled “Crash Dumps and Diagnostics”When your server crashes, you need data to understand why.
Minidumps / core dumps: Configure your server (or the OS) to write crash dumps on fatal errors. This gives your engineering team a stack trace and memory snapshot.
- On Linux, enable core dumps:
ulimit -c unlimited - On Windows, configure the MiniDumpWriteDump API or use Crashpad/Breakpad
- Unreal Engine generates crash reports automatically in Shipping builds
- Unity crash handling is configured via
CrashReportHandlerin Player Settings, or by subscribing toApplication.logMessageReceivedfor custom crash logging
Crash log: Before the process terminates (if possible), write the last known state to a crash log file:
- What map was loaded
- How many players were connected
- What the server was doing when it crashed
- The stack trace (if available)
Write the crash log to a predictable location (e.g., ./logs/crash.log or ./logs/crash_20260115_143201.log).
Auto-Save and Corruption Prevention
Section titled “Auto-Save and Corruption Prevention”A crash between saves loses all progress since the last save. An auto-save that crashes mid-write loses the save file itself.
Auto-save frequency: Save every 5-15 minutes. The interval is a trade-off between data loss on crash and disk I/O load.
Atomic writes: Never overwrite the save file directly.
- Write to a temp file (
world.sav.tmp) - On success, rename the temp file to replace the save (
world.sav) - Rename is atomic on most filesystems; a crash during rename leaves either the old or new file intact
Backup rotation: Keep multiple save generations:
- world.sav Current save
- world.sav.1 Previous save
- world.sav.2 Two saves ago
When saving, rotate: move .1 to .2, move current to .1, then write the new save. If the current save is corrupted, admins can roll back.
Dirty Shutdown Recovery
Section titled “Dirty Shutdown Recovery”When your server starts, it should detect whether the previous run exited cleanly.
A simple approach:
- At startup, create a file (e.g.,
server.lockorrunning.flag) - At clean shutdown, delete the file
- At next startup, check if the file exists. If it does, the previous run did not shut down cleanly.
When a dirty shutdown is detected:
- Log a warning: “Previous server session did not shut down cleanly.”
- Validate the save file before loading (check for truncation, version header, integrity)
- If the save is corrupted, attempt to load a backup
- Report what happened so the admin knows the state
Memory Management
Section titled “Memory Management”Game servers that run for days or weeks are susceptible to memory leaks. A slow leak that consumes an extra 100MB per hour may crash the server in a few days as it runs out of RAM.
- Set memory limits. If the hosting platform sets a memory limit (via cgroup or container config), your server should respect it. Running past the limit results in the OS killing the process (OOM killer), which is an ungraceful termination.
- Monitor memory usage internally. Log memory usage periodically in development. If you notice growth over time, investigate.
- Test with long-running sessions. Don’t just test for 30 minutes. Run your server for 24-48 hours under load and check memory usage.
When the OS OOM-kills your server, the exit code is typically 137 (SIGKILL). There is no chance to save. The only defense is preventing the situation through memory management and auto-saves.
What Hosting Platforms Do Automatically
Section titled “What Hosting Platforms Do Automatically”Most hosting platforms (including Nodecraft) provide resilience features on top of your server:
- Health monitoring: The platform watches for hung or unresponsive servers
- Backup scheduling: Regular snapshots of the server’s data directory
These features work better when your server cooperates: clean exit codes, predictable save locations, and fast startup times.