Some notes on another way to generate one image per link for my linkblog: unfurls. Those are the site previews things like Twitter Cards or Slack Unfurls or Facebook Link Previews. They try to extract a few words and an image from the page; the resulting preview is structured data, not just a picture. Most of these basically work by looking for oEmbed tags, OpenGraph tags, or Twitter’s card tags and then just guessing if those metadata tags aren’t present.
Most unfurls are a mix of text, a little structured data, and an image or video embed. I’m focused on the image. Possibly I could generate an image from textual data too.
Page metadata: OpenGraph and friends
Facebook, Twitter, Slack, etc rely primarily on page metadata to get a text summary and example image. Only works for sites that publish the metadata but since it’s so commonly used now a lot of sites have it. These formats all offer one image. By custom that image tends to be 400+ pixels wide and roughly 2:1 aspect ratio.
OpenGraph is the big standard, originally from Facebook, here’s a useful description of how it works in practice. Dates to 2010. For my purposes the key tag is og:image
. og:title
gives text.
Twitter’s cards are a popular expansion of the OpenGraph idea dating to 2012. More details on supported tags here. twitter:image
is the most relevant, and possibly twitter:player
. There’s more tags for textual metadata than OpenGraph.
oEmbed is the oldest standard, from 2008. Metadata isn’t in the page itself, instead you have to make a query to a special JSON endpoint. Interesting image fields are thumbnail_url
and maybe the photo
and video
types. Lots of textual metadata too. I don’t know how popular oEmbed is these days but Slack supports it.
schema.org Articles also are relevant, particularly since Google supports them. It started in 2011 as a way to tell search engines how to summarize a page. Metadata is in the page in one of three encodings (sigh). There’s also an overwhelming number of tags; maybe thumbnailUrl
is what I want? or image
? Honestly this is all so complicated I’m not in a hurry to learn more.
Without metadata
Metadata is great if it’s present, what do you do if it’s not? Maybe pull an image from the page? This approach doesn’t seem so popular but Feedly does it. There’s a description in item 4 of what they do. Boils down to images tagged webfeedsFeaturedVisual
, or else the first big image in the story, or else the biggest image on the page.
Another option would be to take a screenshot of the page itself. I have some notes on that but it’s not very simple to get a good result.
A third option would be to use the favicon. Not awesome, but it is there.
Heuristics for combining preview data
So now we have myriad ways to find an image for the site; which one do we use? Slack has a great post (from my friend Matt!) about how they generate unfurls. They give oEmbed priority, then Twitter+OpenGraph, then HTML meta tags as a fallback. They also seem to combine data from multiple sources; their cards have room for a lot of text.
Tyler Young from Felt recently tweeted about their solution. Light on details though, mostly it’s just “it’s complicated”. That echoes what I’ve heard informally from folks who’ve worked on this problem at various companies. Lots of one-off hacks in the end.
Code Tools
All this stuff is complicated and of general interest; is there a reusable library for generating previews I can just use? npm.io has a big list, here’s some highlights from it and some others.
metascraper (GitHub) looks to be the most active of the NPMs. MIT license, active development. Has a lot of configurable options and also custom code for popular sites. This looks like a strong contender I should evaluate further.
iFramely (GitHub) is mostly a hosted service but their parser is on GitHub with an MIT License. Javascript, looks like fairly active development. Also promising.
unfurl (GitHub). TypeScript, active development, MIT license.
Link Preview (GitHub) generates OpenGraph, TwitterCard, and oEmbed previews of pages. The source is Javascript and MIT licensed. Last updates about 2 years ago. Looks like a small project.
pyUnfurl (GitHub) does the various metadata things and falls back to favicon. Python, MIT License, small project with last main development 3 years ago.
extruct is in Python (BSD license) and recently updated. OpenGraph but not Twitter (yet, there’s a pull request).
webpreview is in Python (MIT license) and was last updated two years ago. OpenGraph, Twitter, Schema, or else “from the webpage’s content”.
Metaphor is the Python code with the most Google juice but hasn’t been updated in 5 years.
Buying a service
Another option for unfurls is just buying a blackbox service for this. Embed.ly is the one I know; part of the value ad is they have “700+ Official Content Providers” who they’ve worked to interoperate with. Their $9/mo product wants to embed “cards” though, which I take to mean Javascript running on your site. Not for me. The $99/mo product is a lot more flexible but is more than I’d want to spend. iFramely also looks promising although similar problem with pricing / embeds.
Update: Ryan B mentioned using linkpreview.net, it looks promising. It has a JSON API that returns data including an image. The free plan might work for me or paid plans start at $8/mo.
It’d also be possible for me to do a bit of guerilla scraping for my own “service”. Post the link to Feedly or Twitter, then capture what preview they come up with. That might actually work pretty well, at least right until it doesn’t.
My Plan
I should start with Metascraper. It looks good, my only complaint is dealing with Node. Alternately I think I’m stuck writing my own thing, a lightweight OpenGraph / Twitter card scraper plus reimplementing some image choosing heuristics. That’s a lot of work I’d rather not do.
Once I have an unfurl tool I can use I should test it on my linblog. I’m really curious how many of the links I’ve posted have useful metadata. I’d guess about half of them in the last 3 years. Seems worth doing a survey.
Screenshots may still prove to be a useful fallback.