Crash Handling and Resilience

Game servers crash. Hardware fails. Memory runs out. An edge case in your physics engine hits a null pointer. How your server handles these situations determines whether players lose minutes of progress or hours.

Graceful Shutdown with Signals

Before talking about crashes, make sure your server handles normal shutdowns correctly. When a hosting platform or admin wants to stop the server, it sends a signal.

SIGTERM: “Please shut down.” Your server should:

Stop accepting new connections
Notify connected players
Save all world and player data
Flush pending log output
Exit with code 0

SIGINT (Ctrl+C): Should behave identically to SIGTERM.

Your server needs a signal handler that intercepts these signals and runs the shutdown sequence. Without one, the operating system terminates the process immediately, and all unsaved data is lost.

Shutdown timeout: Hosting platforms give your server a window (typically 10-30 seconds) to shut down gracefully. If the server hasn’t exited by then, the platform sends SIGKILL, which cannot be intercepted. Design your shutdown sequence to complete within 10 seconds. If your save data is so large that saving takes longer, you need a faster save mechanism or more frequent auto-saves.

Exit Codes

Exit codes tell the hosting platform what happened when your server stopped.

Code	Meaning	Platform Response
0	Clean shutdown	Normal. No action needed.
1	General error / crash	Log the error. May auto-restart.
137	Killed by SIGKILL (128+9)	Server failed to shut down gracefully, or was killed by the OS OOM killer. Investigate.
139	Segmentation fault (128+11)	Crash. Auto-restart and flag for investigation.

Use exit code 0 exclusively for intentional, clean shutdowns. Any non-zero exit code signals a problem. If you can distinguish between different failure modes, use different codes (e.g., 2 for config error, 3 for failed world load).

Crash Dumps and Diagnostics

When your server crashes, you need data to understand why.

Minidumps / core dumps: Configure your server (or the OS) to write crash dumps on fatal errors. This gives your engineering team a stack trace and memory snapshot.

On Linux, enable core dumps: ulimit -c unlimited
On Windows, configure the MiniDumpWriteDump API or use Crashpad/Breakpad
Unreal Engine generates crash reports automatically in Shipping builds
Unity crash handling is configured via CrashReportHandler in Player Settings, or by subscribing to Application.logMessageReceived for custom crash logging

Crash log: Before the process terminates (if possible), write the last known state to a crash log file:

What map was loaded
How many players were connected
What the server was doing when it crashed
The stack trace (if available)

Write the crash log to a predictable location (e.g., ./logs/crash.log or ./logs/crash_20260115_143201.log).

Auto-Save and Corruption Prevention

A crash between saves loses all progress since the last save. An auto-save that crashes mid-write loses the save file itself.

Auto-save frequency: Save every 5-15 minutes. The interval is a trade-off between data loss on crash and disk I/O load.

Atomic writes: Never overwrite the save file directly.

Write to a temp file (world.sav.tmp)
On success, rename the temp file to replace the save (world.sav)
Rename is atomic on most filesystems; a crash during rename leaves either the old or new file intact

Backup rotation: Keep multiple save generations:

world.sav Current save
world.sav.1 Previous save
world.sav.2 Two saves ago

When saving, rotate: move .1 to .2, move current to .1, then write the new save. If the current save is corrupted, admins can roll back.

Dirty Shutdown Recovery

When your server starts, it should detect whether the previous run exited cleanly.

A simple approach:

At startup, create a file (e.g., server.lock or running.flag)
At clean shutdown, delete the file
At next startup, check if the file exists. If it does, the previous run did not shut down cleanly.

When a dirty shutdown is detected:

Log a warning: “Previous server session did not shut down cleanly.”
Validate the save file before loading (check for truncation, version header, integrity)
If the save is corrupted, attempt to load a backup
Report what happened so the admin knows the state

Memory Management

Game servers that run for days or weeks are susceptible to memory leaks. A slow leak that consumes an extra 100MB per hour may crash the server in a few days as it runs out of RAM.

Set memory limits. If the hosting platform sets a memory limit (via cgroup or container config), your server should respect it. Running past the limit results in the OS killing the process (OOM killer), which is an ungraceful termination.
Monitor memory usage internally. Log memory usage periodically in development. If you notice growth over time, investigate.
Test with long-running sessions. Don’t just test for 30 minutes. Run your server for 24-48 hours under load and check memory usage.

When the OS OOM-kills your server, the exit code is typically 137 (SIGKILL). There is no chance to save. The only defense is preventing the situation through memory management and auto-saves.

What Hosting Platforms Do Automatically

Most hosting platforms (including Nodecraft) provide resilience features on top of your server:

Health monitoring: The platform watches for hung or unresponsive servers
Backup scheduling: Regular snapshots of the server’s data directory

These features work better when your server cooperates: clean exit codes, predictable save locations, and fast startup times.