Linux Commands I Actually Use When a Box Is Misbehaving
The problem¶
This isn't a single incident. It's the set of commands I actually reach for when I SSH into a Linux box that's misbehaving and need to understand what's happening before I start changing things. I've grouped them by what question I'm trying to answer.
What's eating CPU¶
The first thing I want is a live view of where time is going:
But top raw is noisy. I immediately press 1 to show per-CPU usage (tells me if it's a single-threaded bottleneck or system-wide), and M to sort by memory. For a cleaner view:
If htop isn't installed, it's usually one apt install htop or yum install htop away and worth it. The tree view shows parent-child process relationships which matters when something is fork-bombing.
To find what's burning CPU without an interactive tool:
That gives me the top 20 processes by CPU right now. If I want to watch it over time:
If a specific process is the culprit and I want to see which threads inside it are the problem:
-H shows individual threads, -p scopes it to one process. Useful when a multi-threaded Java or .NET process is misbehaving and you need to know if it's one runaway thread or all of them.
What's eating disk¶
Check overall disk usage first:
If a filesystem is nearly full, find what's eating it:
Work top-down: start at the root, find the biggest directory, descend into it:
The 2>/dev/null suppresses permission errors on directories you can't read. Once I've narrowed it down to a folder:
Find files over a specific size:
Runaway log files are usually the culprit. If rotation is broken, a single log file can fill a disk in hours. Check when a file was last modified:
For active log growth, watch a file's size over time:
What's eating memory¶
The available column is what matters, not free. Available includes reclaimable cache.
For a per-process breakdown:
If I suspect a memory leak — a process growing over time — I'll check its current RSS and watch it:
# Current memory for a process
cat /proc/<PID>/status | grep VmRSS
# Watch it over time
watch -n 10 'cat /proc/<PID>/status | grep VmRSS'
If the system is swapping heavily, performance will be terrible even if total memory looks okay:
High si (swap in) and so (swap out) values mean you're in swap hell. Either the process needs more memory, or there's a leak that needs addressing.
What the process is actually doing¶
When a process is hung, slow, or misbehaving and I can't tell why from the outside:
This shows every system call in real time. If it's stuck in a read() or select() call, it's waiting on I/O or a socket. If it's looping through stat() calls on files, it might be polling a directory endlessly. If it's doing nothing — no system calls at all — it's stuck in user-space computation or genuinely hung.
For a snapshot of what file descriptors the process has open:
Open sockets, files being read or written, pipes — all visible here. Useful for confirming a service is actually holding a log file open (which prevents the disk space from being freed when you delete the file without restarting the service).
To see what files a process has open in a more readable form:
Or to find which process is holding a specific file open:
What's happening on the network¶
Check what's listening and what's connected:
ss -tlnp # listening TCP sockets, with process names
ss -tnp # established TCP connections, with process names
I prefer ss over netstat — it's faster and available by default on modern Linux. The -p flag requires root to see process names.
For a count of connections by state (useful for spotting connection pool exhaustion or TIME_WAIT buildup):
If I'm debugging a specific port:
To watch connections in real time:
Digging through logs fast¶
The standard toolkit for log analysis, in order of how often I use it:
# Real-time tail with filtering
tail -f app.log | grep --line-buffered "ERROR"
# Count error types
grep "ERROR" app.log | sort | uniq -c | sort -rn | head -20
# Show context around a match (3 lines before, 3 after)
grep -B 3 -A 3 "OutOfMemoryError" app.log
# Search recursively across all log files in a directory
grep -rn "connection refused" /var/log/myapp/
# Extract specific field from structured logs
grep -oP '"error":"[^"]*"' app.log | sort | uniq -c | sort -rn
# Find all log files modified in the last hour
find /var/log -name "*.log" -mmin -60
For large CLOBs or binary-adjacent content where grep gets confused, strings is occasionally useful:
Lesson¶
The sequence matters. CPU → memory → disk → network → process internals. Don't jump straight to strace because a process looks hung; check if the host is simply resource-starved first. Most "application misbehaving" incidents are actually "host under load" incidents that the application surface is just reporting first.