All Articles

Images, Emails and Content Addressable Storage

I’ve recently been thinking about how best to archive images and other external assets from emails and webpages for later analysis and review. I’ve accumulated many HTML-based emails, and it occurred to me that if I don’t archive the images these emails reference, they may no longer be able to be retrieved. In other words, the links and images may suffer from link rot.

Examples of JPEG Images

Examples of JPEG Images from ~5,000 emails

To avoid this condition, I will archive and store all of the referenced images and other assets (CSS, SVGs). The Assetgraph module is quite helpful with this task, but it did require teaching it how to parse MIME based email messages with inline attachments.

Rather than processing every email, I’d like to estimate the space required to store all of the referenced assets (images, CSS, SVGs).

What is Content-Addressable Storage?

An asset may be referenced more than once in the same email or used in more than one email. It would be suboptimal to store the same content more than once. Holding the assets in a content-addressable method will allow the same content to be only stored once. Being content-addressable means the content itself dictates the location or address for storage. For example, consider a JPEG image; traditionally, the filename for storage would result from applying the MD5 hash function to the URL. Using a content-addressable storage method means that a URL’s storage location is determined by the output of the MD5 or other hash function when applied to the content of the URL. The content of an image, once stored, will never change, but content once added can never be removed since this is no reference tracking to know what content is no longer referenced.

What types of URLs are referenced in emails?

Storage By File Type

Image related MIME types are the most popular referenced URLs. This charge compares the large image MIME types in the sample:

Aggregate Size By File Extension

Histograms of File Size by File Type

Size By File Extension

It is apparent that most files are small, but there is a long tail of much longer files. I truncated the x-axis on these charges to better highlight the distribution.

Extrapolating to the Entire Population

The sample is just 5,071 emails. It is useful to extrapolate to the entire population of 1,000,000 to estimate the storage costs.

Content Type Sample (Gigabytes) Estimated Population (Gigabytes)
image/jpeg 3.21 633.14
image/gif 0.95 187.44
image/png 0.84 165.17
image/webp 0.31 61.04
Total 5.31 1,046.79

The cost to store a gigabyte at AWS costs ~$0.023 in the USA.

The monthly cost to store all of the resources is: $24.07. So far so good.

AWS charges $0.09 per GB to transfer data out, so to retrieve all of the resources it would cost $94.21.

The per-request charges to S3 will also add up with each email referencing about 12 resources each, storing 1,000,000 emails will result in 12,000,000 calls to S3’s PutObject method. Since PutObject requests cost $0.005 per 1000 API calls, it would cost $60.00 to make all of them.

Evaluating WebP

The WebP image format is increasingly available across platforms. It is exciting after about a decade of waiting for broad adoption.

WebP offers both lossy and lossless compression algorithms; this means that it is a candidate to recompress both the PNG and the JPEG files, respectively.

PNG conversion to WebP

Compression Option Compression Ratio
Near Lossless q=80 0.5066
Near Lossless q=90 0.5067
Lossless 0.6014

JPEG conversion to WebP

Compression Option Compression Ratio
Lossy q=75 0.2451
Lossy q=80 0.2955
Lossy q=85 0.3592
Lossy q=90 0.4607
Lossy q=95 0.6310

What could explain the large compression ratios of the JPEG files? It seems that JPEGs in the sample can often have large included color profile information blocks. Publishers of the JPEG files may be unaware of how large the images are; they may assume there is downscaling occurring and optimization further down the pipeline.

Comparison of WebP Quality vs Compression Size Ratio

There appears to be a non-linear relationship between the quality selected for compression and the output size produced.

Is it worth it to convert PNG and JPEG images to WebP?

It is useful to estimate the storage needed if I were to make these changes to the resources upon retrieval:

  • For PNG resources, recompress them using WebP with lossless compression.
  • For JPEG resources, recompress them using WebP with lossy compression and a quality value of 85.
MIME Type Estimated Population (Gigabytes) After Recompression using WebP
image/jpeg 633.14 227.42
image/gif 187.44 187.44
image/png 165.17 83.69
image/webp 61.04 61.04
Total 1,046.79 559.59

The monthly projected storage cost for images is about: $12.87, or a savings of 53%. The transfer-out cost would be $50.36.

Yes, I think it would be worth the effort to recompress every PNG or JPEG image using WebP.

Considering AVIF

AVIF is a new image format that is supposed to replace WebP. Chrome and Firefox support it in their latest versions. It offers impressive compression ratios. Since support for the image format is still emerging in browsers, I will wait a bit more considering it.

As a preview of the future, here is a test for JPEG images.

Compression Option Compression Ratio
Lossy q=80 0.1427

The conversion from JPEG images to AVIF images took considerably longer than the conversion from JPEG to WebP. AVIF did produce an excellent compression ratio. It reduced 3.210 GB of JPEG images to 0.458 GB of AVIF images.