Selectively clear CDN cache with rsync

Selectively clear CDN cache with rsync

I have used a few different CDNs for my static site – Cloudflare, Fastly, and (most recently) BunnyCDN. My site is made in Jekyll and is built/deployed automatically using GitHub Actions. When files are changed, the CDN needs to be informed that the files have been updated. Otherwise, old files will be served until they expire from the cache.

The easy/obvious thing to do is to wipe the entire zone cache on each update, but this is very inefficient in my case. Each update typically changes just one file.

Because the site is built from scratch on a clean CI environment each time, there’s no obvious way to determine which files have been changed. Even if this were not the case, Jekyll resets all timestamps to the current time anyway.

My solution to this problem is to use logs from rsync to find the updated files. To make this work, we need to tell rsync to use checksums for finding differences. Otherwise, it will use a timestamp rule and copy all files. In essence, we are using the endpoint as a database of old file checksums.

The basic outline to do this is:

#!/bin/sh

# Create a tempfile
TMPFILE=$(mktemp /tmp/deploy.XXXXXX)

echo "Copying www files to endpoint"
rsync -rlci --log-file=$TMPFILE --delete "$SOURCE" "$DEST"

# Purge the CDN using values from rsync log
echo "Purging updated files from CDN:"
cat $TMPFILE | grep -E '\] (<fc)' | cut -d \  -f 5 | sed -e 's/index.html$//' | while read -r purgepath; do
    echo "Purge /$purgepath"
    # CURL COMMAND TO PURGE https://domain.tld/$purgepath
done

The above script parses the rsync log to find changed files. To demystify the script, we need to see what an rsync log looks like when we enable checksums:

2022/08/12 15:55:15 [1978] building file list
2022/08/12 15:55:15 [1978] <fc.T...... 404.html
2022/08/12 15:55:15 [1978] <fc.T...... feed.xml
2022/08/12 15:55:15 [1978] <fc.T...... index.html
2022/08/12 15:55:15 [1978] <fc.T...... sitemap.xml
2022/08/12 15:55:15 [1978] <fc.T...... version.txt
2022/08/12 15:55:15 [1978] <fc.T...... blog/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/2/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/3/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/covert-discrimination-tullock/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/econ-ipsum-tex/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/functional-optimization/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/fund-open-source/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/induction-on-reals/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/list-of-econ-blogs/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/optimal-fair-contests/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/r-fix-onedrive/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/screen/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/stop-email-scraping/index.html
2022/08/12 15:55:15 [1978] <fc.T...... blog/war-of-attrition-complete-information/index.html
2022/08/12 15:55:15 [1978] <fc.T...... fancyindex/header.html
2022/08/12 15:55:15 [1978] <fc.T...... mirrors/index.html
2022/08/12 15:55:15 [1978] <fc.T...... papers/asymmetric-all-pay-auctions-with-spillovers/index.html
2022/08/12 15:55:15 [1978] <fc.T...... papers/covert-discrimination-in-all-pay-contests/index.html
2022/08/12 15:55:15 [1978] sent 35764 bytes  received 16916 bytes  total size 4850448

Note that a modified file that is copied to the destination is indicated with <fc. So the first pipe in our script is to keep only these lines with grep -E '\] (<fc)'. You can match more patterns using something like grep -E '\] (<fc|<f+)'. This would also match new files, which does not make sense for this purpose.

We then use cut -d \ -f 5 to keep only the path of each line (5th space delimited segment). What we now have is a list of modified files:

404.html
feed.xml
index.html
sitemap.xml
version.txt
blog/index.html
blog/2/index.html
blog/3/index.html
blog/covert-discrimination-tullock/index.html
blog/econ-ipsum-tex/index.html
blog/functional-optimization/index.html
blog/fund-open-source/index.html
blog/induction-on-reals/index.html
blog/list-of-econ-blogs/index.html
blog/optimal-fair-contests/index.html
blog/r-fix-onedrive/index.html
blog/screen/index.html
blog/stop-email-scraping/index.html
blog/war-of-attrition-complete-information/index.html
fancyindex/header.html
mirrors/index.html
papers/asymmetric-all-pay-auctions-with-spillovers/index.html
papers/covert-discrimination-in-all-pay-contests/index.html

We probably don’t want to purge /index.html from the cache. No one accesses this file directly. We want to purge /. The next command, sed -e 's/index.html$//', finds all instances of index.html and deletes them (replace them with nothing).

404.html
feed.xml

sitemap.xml
version.txt
blog/
blog/2/
blog/3/
blog/covert-discrimination-tullock/
blog/econ-ipsum-tex/
blog/functional-optimization/
blog/fund-open-source/
blog/induction-on-reals/
blog/list-of-econ-blogs/
blog/optimal-fair-contests/
blog/r-fix-onedrive/
blog/screen/
blog/stop-email-scraping/
blog/war-of-attrition-complete-information/
fancyindex/header.html
mirrors/
papers/asymmetric-all-pay-auctions-with-spillovers/
papers/covert-discrimination-in-all-pay-contests/

All that’s left to do is iterate down the list and clear each line from the CDN cache. I do this with cURL. For BunnyCDN, I run

curl --get --url 'https://api.bunny.net/purge' \
     --header 'Accept: application/json' \ 
     --header "AccessKey: $1" \
     --data-urlencode "url=https://$PULL_ZONE/$purgepath"

but you can find an example for your CDN using its API documentation.


What do you think? There are probably a lot of edge cases that this does not work for (eg. spaces in filenames). It works well for me though. You can see an example of this in my GitHub Actions workflow.

1 Like