Some notes on another way to generate one image per link for my linkblog: unfurls. Those are the site previews things like Twitter Cards or Slack Unfurls or Facebook Link Previews. They try to extract a few words and an image from the page; the resulting preview is structured data, not just a picture. Most of these basically work by looking for oEmbed tags, OpenGraph tags, or Twitter’s card tags and then just guessing if those metadata tags aren’t present.
Most unfurls are a mix of text, a little structured data, and an image or video embed. I’m focused on the image. Possibly I could generate an image from textual data too.
Page metadata: OpenGraph and friends
Facebook, Twitter, Slack, etc rely primarily on page metadata to get a text summary and example image. Only works for sites that publish the metadata but since it’s so commonly used now a lot of sites have it. These formats all offer one image. By custom that image tends to be 400+ pixels wide and roughly 2:1 aspect ratio.
OpenGraph is the big standard, originally from Facebook, here’s a useful description of how it works in practice. Dates to 2010. For my purposes the key tag is
og:title gives text.
Twitter’s cards are a popular expansion of the OpenGraph idea dating to 2012. More details on supported tags here.
twitter:image is the most relevant, and possibly
twitter:player. There’s more tags for textual metadata than OpenGraph.
oEmbed is the oldest standard, from 2008. Metadata isn’t in the page itself, instead you have to make a query to a special JSON endpoint. Interesting image fields are
thumbnail_url and maybe the
video types. Lots of textual metadata too. I don’t know how popular oEmbed is these days but Slack supports it.
schema.org Articles also are relevant, particularly since Google supports them. It started in 2011 as a way to tell search engines how to summarize a page. Metadata is in the page in one of three encodings (sigh). There’s also an overwhelming number of tags; maybe
thumbnailUrl is what I want? or
image? Honestly this is all so complicated I’m not in a hurry to learn more.
Metadata is great if it’s present, what do you do if it’s not? Maybe pull an image from the page? This approach doesn’t seem so popular but Feedly does it. There’s a description in item 4 of what they do. Boils down to images tagged
webfeedsFeaturedVisual, or else the first big image in the story, or else the biggest image on the page.
Another option would be to take a screenshot of the page itself. I have some notes on that but it’s not very simple to get a good result.
A third option would be to use the favicon. Not awesome, but it is there.
Heuristics for combining preview data
So now we have myriad ways to find an image for the site; which one do we use? Slack has a great post (from my friend Matt!) about how they generate unfurls. They give oEmbed priority, then Twitter+OpenGraph, then HTML meta tags as a fallback. They also seem to combine data from multiple sources; their cards have room for a lot of text.
Tyler Young from Felt recently tweeted about their solution. Light on details though, mostly it’s just “it’s complicated”. That echoes what I’ve heard informally from folks who’ve worked on this problem at various companies. Lots of one-off hacks in the end.
All this stuff is complicated and of general interest; is there a reusable library for generating previews I can just use? npm.io has a big list, here’s some highlights from it and some others.
metascraper (GitHub) looks to be the most active of the NPMs. MIT license, active development. Has a lot of configurable options and also custom code for popular sites. This looks like a strong contender I should evaluate further.
unfurl (GitHub). TypeScript, active development, MIT license.
pyUnfurl (GitHub) does the various metadata things and falls back to favicon. Python, MIT License, small project with last main development 3 years ago.
extruct is in Python (BSD license) and recently updated. OpenGraph but not Twitter (yet, there’s a pull request).
webpreview is in Python (MIT license) and was last updated two years ago. OpenGraph, Twitter, Schema, or else “from the webpage’s content”.
Metaphor is the Python code with the most Google juice but hasn’t been updated in 5 years.
Buying a service
Update: Ryan B mentioned using linkpreview.net, it looks promising. It has a JSON API that returns data including an image. The free plan might work for me or paid plans start at $8/mo.
It’d also be possible for me to do a bit of guerilla scraping for my own “service”. Post the link to Feedly or Twitter, then capture what preview they come up with. That might actually work pretty well, at least right until it doesn’t.
I should start with Metascraper. It looks good, my only complaint is dealing with Node. Alternately I think I’m stuck writing my own thing, a lightweight OpenGraph / Twitter card scraper plus reimplementing some image choosing heuristics. That’s a lot of work I’d rather not do.
Once I have an unfurl tool I can use I should test it on my linblog. I’m really curious how many of the links I’ve posted have useful metadata. I’d guess about half of them in the last 3 years. Seems worth doing a survey.
Screenshots may still prove to be a useful fallback.