06.14
so i had to find a new host for the podcast thanks to dreamhost being complete dicks and terminating my entire account. (sigh)
so i decided that since the podcast was unavailable for a few days (hope no one noticed), it would be a good time to analyze day 1 web traffic in a fun and interesting way.
all web servers log requests to a file. each request is a single line in the logfile that looks something like this
127.0.0.1 - - [04/Jun/2014:21:00:48 -0700] "GET /ARCHIVE_-_0000-00-00_-_Show_Name_With_DJ_Name_From_Location.mp3 HTTP/1.1" 200 140398275 "-" "iTunes/11.2.1 (Macintosh; OS X 10.9.3) AppleWebKit/537.75.14"
there is a log entry for every whole or partial content request. every time you visit a website, for every image you see, for every js, css, etc file that loads is an entry. on busy sites this log file can become huge. the log file i used for this demonstration has 22,577 lines. each line (see above) can tell you a few things:
the ip address of the person making the request
time stamp
request type and request
http response code
bytes transferred
user agent
unfortunately, the new hosting provider only allows access to the raw apache logs for the current day. i learned this while emailing support for a few hours. apparently they rotate the logs at 9pm pst every day… and guess what, its not really rotated (sigh). once 9pm rolls around, the log file is deleted/emptied. so theres only 1 log file. no historical data whatsoever. i guess since i configured the vhost after 9pm on june 3rd, thus never creating a file in my home directory. so as you can see, i missed out. the data that i REALLY, REALLY wanted to have, which was the FIRST 24 hours, 86’d.
ok.. so now.. set a crontab to grab the file just before they delete it, otherwise you loose, and its gone forever.
55 20 * * * scp user@host:~/logs/podcast.access.log /var/log/apache2/podcast.access.log >> /dev/null 2>&1
and now, with a wonderful tool called logstalgia combined with ffmpeg and some command line-fu
cat /var/log/apache2/podcast.access.log | logstalgia -1920x1080 -g Archives,ARCHIVE,99 --paddle-mode pid --update-rate 1 --output-framerate 60 --output-ppm-stream - - | ffmpeg -f image2pipe -r 60 -c:v ppm -s 1920x1080 -pix_fmt yuv420p -i - -crf 1 -c:v h264 -pix_fmt yuv420p -f mp4 outfile.mp4
and if you want real time…
tail -F -q /var/log/apache2/podcast.access.log | logstalgia -1920x1080 -g Archives,ARCHIVE,99 --paddle-mode pid --update-rate 1 --output-framerate 60 --output-ppm-stream - - | ffmpeg -f image2pipe -r 60 -c:v ppm -s 1920x1080 -pix_fmt yuv420p -i - -crf 1 -c:v h264 -pix_fmt yuv420p -f flv rtmp://videoservice.com/live/secretkey
i will also add that due to storage limits on the new host, only the archives from 2013 Q4 and this year (2014) are on the podcast. this would be much different otherwise =]
and.. the timescale is non-linear! (because i like to fiddle)
Date | Hits | Bandwidth
6/04 | 20,034 | 2365.35 GB (2.3TB)
6/05 | 12,702 | 1528.78 GB (1.5TB)
T: 2 | 32,736 | 3894.13 GB (3.8TB)
and heres one using logs from a more well established instance of apache.