Build a distributed logging stack (ELK / Loki) (12 scenes)
Scene 01 · grep + ssh stops working
Once your fleet has more than a handful of hosts, the only way to answer 'which host saw this error' is to ship lines centrally — local files plus ssh is an O(N) dead end.
Scene 01
grep + ssh stops working
Diagram
Left: a developer laptop running `grep "connection refused"`. From it, ssh arrows fan out to a grid of host boxes — 3 large tiles, then a 5x6 grid, then a dense 15x20 dot field as the slider moves. Each host shows /var/log/app.log with its rotated siblings stacked underneath (app.log, app.log.1, app.log.2.gz dimmed) and a 'last write' timestamp. Bottom: a wall-clock 'time to answer' meter for the query 'find all ERRORs in the last hour'. At the largest fleet size, a faint cloud arrow leaves the right edge of the diagram — a placeholder for somewhere these lines could be sent.
Three hosts. You suspect a connection error so you ssh to each one in turn and grep for 'connection refused'. The time-to-answer meter at the bottom is the wall clock you're spending. Notice the rotated files stacked under each app.log — yesterday's data is already gzipped.
Implementation
Engineer.find_error
the on-call's serial ssh-grep walk across the fleet
1# you are the loop body2for host in hosts: # O(N) in fleet size3 out = ssh(host,4 'grep -h ERROR /var/log/app.log'5 ' /var/log/app.log.1'6 ' 2>/dev/null')7 if ssh.exit_code != 0:8 print(host, 'unreachable') # debugging the debugger9 continue10 for line in out.splitlines():11 print(host, line) # may be empty: file rotated
Logrotate.rotate
the cron-driven cycle that evicts yesterday's file
1# /etc/logrotate.d/app, run nightly by cron2def rotate(path='/var/log/app.log', keep=3):3 # shift the chain: app.log.2.gz -> app.log.3.gz, etc.4 for i in range(keep, 0, -1):5 mv(f'{path}.{i}.gz', f'{path}.{i+1}.gz')6 # yesterday's file becomes app.log.1, then gets gzipped7 mv(path, f'{path}.1')8 gzip(f'{path}.1') # -> app.log.1.gz9 truncate(path) # app starts fresh10 # anything past `keep` is deleted from local disk11 rm(f'{path}.{keep+1}.gz') # silent eviction12 signal(app_pid, SIGHUP) # reopen log file
Host.serve_log
how one line ends up in app.log and later in app.log.N.gz
1# the application is the only writer; logrotate is the only mover2def emit(line):3 open('/var/log/app.log', 'a').write(line + '\n')45# on-disk layout after a few days of rotation:6# /var/log/app.log <- today, being appended7# /var/log/app.log.1.gz <- yesterday, compressed8# /var/log/app.log.2.gz <- 2 days ago9# /var/log/app.log.3.gz <- 3 days ago (last kept)10# anything older was rm'd by Logrotate.rotate — gone from disk.