Build a distributed logging stack (ELK / Loki) (12 scenes)
Scene 01 · grep + ssh stops working
Once your fleet has more than a handful of hosts, the only way to answer 'which host saw this error' is to ship lines centrally — local files plus ssh is an O(N) dead end.
Scene 01
grep + ssh stops working
Diagram
Left: a developer laptop running `grep "connection refused"`. From it, ssh arrows fan out to a grid of host boxes — 3 large tiles, then a 5x6 grid, then a dense 15x20 dot field as the slider moves. Each host shows /var/log/app.log with its rotated siblings stacked underneath (app.log, app.log.1, app.log.2.gz dimmed) and a 'last write' timestamp. Bottom: a wall-clock 'time to answer' meter for the query 'find all ERRORs in the last hour'. At the largest fleet size, a faint cloud arrow leaves the right edge of the diagram — a placeholder for somewhere these lines could be sent.
DEV LAPTOP$ for h in $hosts; do ssh $h \\ grep 'ERROR' /var/log/app.logdone# find ERRORs in last hourRESPONSEssh fan-out: ~0.4shosts queried: 3FLEET · 3 hostshost-001last write 12:04:21/var/log/app.logapp.log.1app.log.2.gzhost-002last write 12:04:19/var/log/app.logapp.log.1app.log.2.gzhost-003last write 12:04:22/var/log/app.logapp.log.1app.log.2.gzTIME TO ANSWER0ms10s60s3 hosts. ssh + grep answers in seconds — slow, but it works.
Three hosts. You suspect a connection error so you ssh to each one in turn and grep for 'connection refused'. The time-to-answer meter at the bottom is the wall clock you're spending. Notice the rotated files stacked under each app.log — yesterday's data is already gzipped.
Implementation
Engineer.find_error
the on-call's serial ssh-grep walk across the fleet
1# you are the loop body
2for host in hosts: # O(N) in fleet size
3 out = ssh(host,
4 'grep -h ERROR /var/log/app.log'
5 ' /var/log/app.log.1'
6 ' 2>/dev/null')
7 if ssh.exit_code != 0:
8 print(host, 'unreachable') # debugging the debugger
9 continue
10 for line in out.splitlines():
11 print(host, line) # may be empty: file rotated
Logrotate.rotate
the cron-driven cycle that evicts yesterday's file
1# /etc/logrotate.d/app, run nightly by cron
2def rotate(path='/var/log/app.log', keep=3):
3 # shift the chain: app.log.2.gz -> app.log.3.gz, etc.
4 for i in range(keep, 0, -1):
5 mv(f'{path}.{i}.gz', f'{path}.{i+1}.gz')
6 # yesterday's file becomes app.log.1, then gets gzipped
7 mv(path, f'{path}.1')
8 gzip(f'{path}.1') # -> app.log.1.gz
9 truncate(path) # app starts fresh
10 # anything past `keep` is deleted from local disk
11 rm(f'{path}.{keep+1}.gz') # silent eviction
12 signal(app_pid, SIGHUP) # reopen log file
Host.serve_log
how one line ends up in app.log and later in app.log.N.gz
1# the application is the only writer; logrotate is the only mover
2def emit(line):
3 open('/var/log/app.log', 'a').write(line + '\n')
4
5# on-disk layout after a few days of rotation:
6# /var/log/app.log <- today, being appended
7# /var/log/app.log.1.gz <- yesterday, compressed
8# /var/log/app.log.2.gz <- 2 days ago
9# /var/log/app.log.3.gz <- 3 days ago (last kept)
10# anything older was rm'd by Logrotate.rotate — gone from disk.