Linux Commands I Actually Use When a Box Is Misbehaving

The problem¶

This isn't a single incident. It's the set of commands I actually reach for when I SSH into a Linux box that's misbehaving and need to understand what's happening before I start changing things. I've grouped them by what question I'm trying to answer.

What's eating CPU¶

The first thing I want is a live view of where time is going:

top

But top raw is noisy. I immediately press 1 to show per-CPU usage (tells me if it's a single-threaded bottleneck or system-wide), and M to sort by memory. For a cleaner view:

htop

If htop isn't installed, it's usually one apt install htop or yum install htop away and worth it. The tree view shows parent-child process relationships which matters when something is fork-bombing.

To find what's burning CPU without an interactive tool:

ps aux --sort=-%cpu | head -20

That gives me the top 20 processes by CPU right now. If I want to watch it over time:

watch -n 2 'ps aux --sort=-%cpu | head -10'

If a specific process is the culprit and I want to see which threads inside it are the problem:

top -H -p <PID>

-H shows individual threads, -p scopes it to one process. Useful when a multi-threaded Java or .NET process is misbehaving and you need to know if it's one runaway thread or all of them.

What's eating disk¶

Check overall disk usage first:

df -h

If a filesystem is nearly full, find what's eating it:

du -sh /var/log/*  | sort -rh | head -20

Work top-down: start at the root, find the biggest directory, descend into it:

du -sh /* 2>/dev/null | sort -rh | head -10

The 2>/dev/null suppresses permission errors on directories you can't read. Once I've narrowed it down to a folder:

du -sh /var/log/myapp/* | sort -rh | head -10

Find files over a specific size:

find /var/log -type f -size +100M -exec ls -lh {} \;

Runaway log files are usually the culprit. If rotation is broken, a single log file can fill a disk in hours. Check when a file was last modified:

ls -lht /var/log/myapp/ | head -20

For active log growth, watch a file's size over time:

watch -n 5 'ls -lh /var/log/myapp/app.log'

What's eating memory¶

free -h

The available column is what matters, not free. Available includes reclaimable cache.

For a per-process breakdown:

ps aux --sort=-%mem | head -20

If I suspect a memory leak — a process growing over time — I'll check its current RSS and watch it:

# Current memory for a process
cat /proc/<PID>/status | grep VmRSS

# Watch it over time
watch -n 10 'cat /proc/<PID>/status | grep VmRSS'

If the system is swapping heavily, performance will be terrible even if total memory looks okay:

vmstat 2 5

High si (swap in) and so (swap out) values mean you're in swap hell. Either the process needs more memory, or there's a leak that needs addressing.

What the process is actually doing¶

When a process is hung, slow, or misbehaving and I can't tell why from the outside:

strace -p <PID>

This shows every system call in real time. If it's stuck in a read() or select() call, it's waiting on I/O or a socket. If it's looping through stat() calls on files, it might be polling a directory endlessly. If it's doing nothing — no system calls at all — it's stuck in user-space computation or genuinely hung.

For a snapshot of what file descriptors the process has open:

ls -la /proc/<PID>/fd

Open sockets, files being read or written, pipes — all visible here. Useful for confirming a service is actually holding a log file open (which prevents the disk space from being freed when you delete the file without restarting the service).

To see what files a process has open in a more readable form:

lsof -p <PID>

Or to find which process is holding a specific file open:

lsof /var/log/myapp/app.log

What's happening on the network¶

Check what's listening and what's connected:

ss -tlnp   # listening TCP sockets, with process names
ss -tnp    # established TCP connections, with process names

I prefer ss over netstat — it's faster and available by default on modern Linux. The -p flag requires root to see process names.

For a count of connections by state (useful for spotting connection pool exhaustion or TIME_WAIT buildup):

ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

If I'm debugging a specific port:

ss -tnp | grep :1521

To watch connections in real time:

watch -n 2 'ss -tnp | grep :8080'

Digging through logs fast¶

The standard toolkit for log analysis, in order of how often I use it:

# Real-time tail with filtering
tail -f app.log | grep --line-buffered "ERROR"

# Count error types
grep "ERROR" app.log | sort | uniq -c | sort -rn | head -20

# Show context around a match (3 lines before, 3 after)
grep -B 3 -A 3 "OutOfMemoryError" app.log

# Search recursively across all log files in a directory
grep -rn "connection refused" /var/log/myapp/

# Extract specific field from structured logs
grep -oP '"error":"[^"]*"' app.log | sort | uniq -c | sort -rn

# Find all log files modified in the last hour
find /var/log -name "*.log" -mmin -60

For large CLOBs or binary-adjacent content where grep gets confused, strings is occasionally useful:

strings suspicious_file | grep -i "error\|exception\|failed"

Lesson¶

The sequence matters. CPU → memory → disk → network → process internals. Don't jump straight to strace because a process looks hung; check if the host is simply resource-starved first. Most "application misbehaving" incidents are actually "host under load" incidents that the application surface is just reporting first.