I’ve recently been thinking about how best to archive images and other external assets from emails and webpages for later analysis and review. I’ve accumulated many HTML-based emails, and it occurred to me that if I don’t archive the images these emails reference, they may no longer be able to be retrieved. In other words, the links and images may suffer from link rot.
Examples of JPEG Images from ~5,000 emails
To avoid this condition, I will archive and store all of the referenced images and other assets (CSS, SVGs). The Assetgraph module is quite helpful with this task, but it did require teaching it how to parse MIME based email messages with inline attachments.
Rather than processing every email, I’d like to estimate the space required to store all of the referenced assets (images, CSS, SVGs).
What is Content-Addressable Storage?
An asset may be referenced more than once in the same email or used in more than one email. It would be suboptimal to store the same content more than once. Holding the assets in a content-addressable method will allow the same content to be only stored once. Being content-addressable means the content itself dictates the location or address for storage. For example, consider a JPEG image; traditionally, the filename for storage would result from applying the MD5 hash function to the URL. Using a content-addressable storage method means that a URL’s storage location is determined by the output of the MD5 or other hash function when applied to the content of the URL. The content of an image, once stored, will never change, but content once added can never be removed since this is no reference tracking to know what content is no longer referenced.
What types of URLs are referenced in emails?
Image related MIME types are the most popular referenced URLs. This charge compares the large image MIME types in the sample:
Histograms of File Size by File Type
It is apparent that most files are small, but there is a long tail of much longer files. I truncated the x-axis on these charges to better highlight the distribution.
Extrapolating to the Entire Population
The sample is just 5,071 emails. It is useful to extrapolate to the entire population of 1,000,000 to estimate the storage costs.
Content Type | Sample (Gigabytes) | Estimated Population (Gigabytes) |
---|---|---|
image/jpeg | 3.21 | 633.14 |
image/gif | 0.95 | 187.44 |
image/png | 0.84 | 165.17 |
image/webp | 0.31 | 61.04 |
Total | 5.31 | 1,046.79 |
The cost to store a gigabyte at AWS costs ~$0.023 in the USA.
The monthly cost to store all of the resources is: $24.07. So far so good.
AWS charges $0.09 per GB to transfer data out, so to retrieve all of the resources it would cost $94.21.
The per-request charges to S3 will also add up with each email referencing about 12 resources each, storing 1,000,000 emails will result in 12,000,000 calls to S3’s PutObject method. Since PutObject requests cost $0.005 per 1000 API calls, it would cost $60.00 to make all of them.
Evaluating WebP
The WebP image format is increasingly available across platforms. It is exciting after about a decade of waiting for broad adoption.
WebP offers both lossy and lossless compression algorithms; this means that it is a candidate to recompress both the PNG and the JPEG files, respectively.
PNG conversion to WebP
Compression Option | Compression Ratio |
---|---|
Near Lossless q=80 | 0.5066 |
Near Lossless q=90 | 0.5067 |
Lossless | 0.6014 |
JPEG conversion to WebP
Compression Option | Compression Ratio |
---|---|
Lossy q=75 | 0.2451 |
Lossy q=80 | 0.2955 |
Lossy q=85 | 0.3592 |
Lossy q=90 | 0.4607 |
Lossy q=95 | 0.6310 |
What could explain the large compression ratios of the JPEG files? It seems that JPEGs in the sample can often have large included color profile information blocks. Publishers of the JPEG files may be unaware of how large the images are; they may assume there is downscaling occurring and optimization further down the pipeline.
There appears to be a non-linear relationship between the quality selected for compression and the output size produced.
Is it worth it to convert PNG and JPEG images to WebP?
It is useful to estimate the storage needed if I were to make these changes to the resources upon retrieval:
- For PNG resources, recompress them using WebP with lossless compression.
- For JPEG resources, recompress them using WebP with lossy compression and a quality value of 85.
MIME Type | Estimated Population (Gigabytes) | After Recompression using WebP |
---|---|---|
image/jpeg | 633.14 | 227.42 |
image/gif | 187.44 | 187.44 |
image/png | 165.17 | 83.69 |
image/webp | 61.04 | 61.04 |
Total | 1,046.79 | 559.59 |
The monthly projected storage cost for images is about: $12.87, or a savings of 53%. The transfer-out cost would be $50.36.
Yes, I think it would be worth the effort to recompress every PNG or JPEG image using WebP.
Considering AVIF
AVIF is a new image format that is supposed to replace WebP. Chrome and Firefox support it in their latest versions. It offers impressive compression ratios. Since support for the image format is still emerging in browsers, I will wait a bit more considering it.
As a preview of the future, here is a test for JPEG images.
Compression Option | Compression Ratio |
---|---|
Lossy q=80 | 0.1427 |
The conversion from JPEG images to AVIF images took considerably longer than the conversion from JPEG to WebP. AVIF did produce an excellent compression ratio. It reduced 3.210 GB of JPEG images to 0.458 GB of AVIF images.