Linkblog architecture

2023/07/15 ~ nelsonminar ~ 1 Comment

I’ve seen some folks talking about setting up a new linkblog so I thought this would be a good time to document how my linkblog works. I’ve run this publication over nearly 20 years and am pretty happy with the current iteration. See also previous detailed posts.

Summary: I post links to pinboard. A cron job on my own server downloads the pinboard data to sqlite and renders static HTML, RSS/Atom, and Mastodon posts from it. There’s also some code for generating screenshots of the linked pages.

Simpler alternative

There’s no need for custom code or your own server. You can run a simple linkblog just using Pinboard or Raindrop. Those sites offer both an HTML view and a very functional RSS feed. There’s a bunch of services that will mirror RSS to social media like Mastodon; IFTTT, maybe Zapier. That’s how I ran my linkblog for many years.

My data model

My linkblog is a collection of posts. Each post has six elements:

URL of the page I’m linking
Short description
Extended description
Tags (including a +/- sentiment tag)
Date I made the post
Hash of the post

This is basically Pinboard’s schema, not an accident!

One thing I don’t have for each of my posts is a URL for my post itself. There’s no way to link to one linkblog post of mine. My RSS feed’s links don’t go to my own linkblog; they go directly to the page I’m linking. I think this is the defining feature of what a linkblog is but not everyone who linkblogs agrees. I have some detailed notes on what other linkblogs do.

The hash is a little strange; it’s something Pinboard generates, a 32 character string. I use it in places where I need something unique for each link I blogged like the ID for each Atom feed entry or the URL for an image preview.

Posting a new link

When I want to post a new link I use the Pinboard Plus extension in my browser. Really any way to create a Pinboard post would work, this one is just convenient for me. The posting UI looks like this:

Publishing new posts

I have a cron job on a personal server that polls Pinboard every 5 minutes checking for a new post. When it finds one it does a series of things:

Add the post to a local sqlite mirror of Pinboard data
Create image previews for the new link
Render static HTML and Atom feeds for the whole linkblog
Push those static files to a web host
Post new links to Mastodon

Image previews

A side / optional feature, an image for each link I post. It boils down to using Metascraper to look for a suitable image unfurl or else shot-scraper for a screenshot. Here’s some detailed notes on page previews.

HTML and Atom rendering

Most of the code I wrote is for rending my linkblog to HTML and Atom feeds. It’s pretty straightforward though. It should be possible to use a static blog generator like Hugo instead but I wanted complete control so I rolled my own. I don’t worry too much about the HTML page functionality, I expect most people read my linkblog via a feed reader or else social media.

Social Media posting

New posts go to social media, these days that’s Mastodon (it used to be Twitter.) This process is driven entirely from the Atom feed, not from my rendering code or sqlite mirror. Currently I’m running feed2toot on my server to do the posts, it works fine. A service like Zapier or IFTTT would work too and be one less thing I have to manage myself.

Thoughts

I set up this new architecture last year and have been very happy with how robust it is. Cron jobs and static publication are a great way to build things like this. I do regret running my own server for the rendering piece but I wanted to experiment with something custom to me, so it was effort well spent. Fortunately it seems to be running by itself with little work for the last year.

I think an existing bookmarking service could run a whole linkblog. I haven’t tried it but Raindrop.io looks to have all the important components: image previews, feeds, and a nice public page for an HTML view. It’d come down to details and whether they center links to raindrop.io itself or the destination site that was linked.

Linkblog stats

2023/01/062023/07/15 ~ nelsonminar

I got an archive view of my linkblog working; 20,000+ links over 19 years!

Here’s a dump of links per year. I’ve been pretty consistent.

2003 477
2004 1487
2005 1218
2006 912
2007 1011
2008 665
2009 757
2010 815
2011 1463
2012 1182
2013 1010
2014 1168
2015 1007
2016 979
2017 1453
2018 1311
2019 915
2020 1133
2021 809
2022 927
2023 17

To generate some rollup stats in my database:

create temporary table stats as select year, month, count(year) as count from (select strftime("%Y", ts, "unixepoch") as year, strftime("%m", ts, "unixepoch") as month from links) group by year, month;

Desktop feed readers

2022/08/02 ~ nelsonminar

While working on my Atom feed for the linkblog I was frustrated I had no way to preview a feed. The only reader I knew about was Feedly, the hosted app, and they don’t have a way to say “reload this feed” to easily see changes. I couldn’t find a modern desktop client feed reader at all. Turns out there‘s a few, here’s three I tried.

QuiteRSS: simple, no nonsense. Last release was April 2020. There’s a fork with a little more tinkering but it’s not a lively project.
SeaMonkey: the continuation of the old Mozilla hairball of Internet apps. It has a decent feed reader. And active development.
LifeRea: a Linux app. Works in WSLg. Actively developed.

I used this to diagnose a formatting issue with my atom feed. My summaries are now HTML but I wasn’t really formatting them as HTML with <p> tags and the like so the images were being inlined in an ugly way. Fixed now.

Screenshot collage

2022/07/302023/07/15 ~ nelsonminar

Now that I have 18,363 screenshots of linkblog web pages, what can I do with them? My first thought was to make a giant zoomable collage but an experiment with the 960 most recent makes me think that would not be very interesting. (This image is 2688×1470; click to embiggen.)

Generated with ./magick montage -geometry 80x45+2+2 *.webp output.webp. (Fun fact; if you forget the output filename imagemagick will happily overwrite the last input file.)

I think the old screenshots could be interesting but not in this random jumbled display.

Doing all 18,000 images would result in a file about 19x bigger. 11 megabytes, and about 12000×6400 pixels big. That’s shrinking the screenshots down from 640×360 to 80×45; a full resolution file of all images would be 64x bigger, so 700MB and about 100,000 x 56,000.

Writing an Atom feed in 2022

2022/07/222023/07/15 ~ nelsonminar ~ 1 Comment

I decided to roll my own feed for my linkblog, to put my nice images in with the text. (Pinboard’s RSS 1.0 feed has done me well many years!) This turns out to be surprisingly difficult and all tainted by the way RSS/Atom hasn’t really gotten any development in 5+ years.

So which; RSS or Atom? Ugh! Happily the answer in 2022 seems to be “literally no one cares”. So I’m going mostly with Atom for old times’ sake. But I ended up coding this with feedgen which can emit both RSS 2.0 and Atom. (RIP, RSS 1.0 and 0.91). There’s still a lot of work in actually creating the feed and populating all the complicated fields.

Useful Atom resources:

W3C friendly description of Atom format, much more readable than the RFC.
W3C validator
RSS feed best practices, recently written and thoughtful
Atom RFC written in the typical opaque RFC style.

One thing I couldn’t find is a nice Atom preview tool to render what the feed might look like. In the old days I’d use a desktop feed reader or maybe my browser to do this but all I know about now is hosted apps like Feedly where it’s a little tricky to get it to reload as I iterate. Oh well.

Boy Atom sure is complicated. I was briefly involved in the creation of the Atom format but I am kind of horrified at what it is. Every single feed is supposed to have a “self link” pointing to itself. Feed entries can too, but it’s not necessary. Entries can have titles, contents, and summaries. Two dates per entry; published and updated. Etc etc. Not all of these are mandatory but sheesh. Also it’s all XML. Still better than RSS 2.0 though. It’s a shame JSON Feed hasn’t caught on although the last thing we need is yet another feed format.

Update: feed seems to work on my first try! Not sure I’ll promote it though, it’s not really better than the Pinboard feed. The thing I added was an <img> tag in the description with the preview. But both Feedly and Slack are providing preview images even without it, through their own preview systems, and the result is more or less the same. I’m not sure if they’re even using my image tag at all. For about 10% of my links it should allow for a screenshot from my own code for links where Slack/Feedly won’t show an image at all.

Update 2 My last post (ERCOT Dashboard) is a good test; there’s no OpenGraph-style preview image available but I got a good screenshot. Slack just shows no image (ignoring mine). Feedly shows my screenshot though!

Blocked by Akamai

2022/07/212023/11/23 ~ nelsonminar

For the past two days I’ve been taking web previews of all 25,000 links from my linkblog. Akamai seems to have blocked me in retaliation. Requests to Akamai-hosted services like www.justice.gov are giving me an old school unstyled 403 Forbidden.

I assume they think I’m a scraper of some sort. Which I am, but an awfully low key one. I’ve got a single thread downloading web pages one at a time, every 1-10 seconds. Surprised that triggers Akamai’s defenses. It’s not a huge deal for my current project, but I sure hope it goes away once I stop my survey because it’d be awfully annoying to be blocked from 10% of the Internet permanently.

A little curious how they even caught on to me. I imagine user agents; the only place I make an effort to pretend to be a desktop browser is when I download the actual image named in OpenGraph data. (Oddly this fixed a lot of problems; apparently I can download the HTML with some user agents but not the linked image?!)

I’ve got metascrape set up using undici as the HTTP client and it sends a user agent of undici.

shot-scraper’s default user-agent is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/102.0.5005.40 Safari/537.36 which looks close to a legit desktop browser but not exactly. It’s possible to reconfigure what shot-scraper uses.

User agent is the dumbest sort of security; it’s trivially easy to spoof, so really you’re only blocking lazy well meaning people like me if you filter on it. But there’s much more aggressive forms of user agent detection. See curl-impersonate which tries to mimic desktop browsers’ SSL behavior to evade more hardcore detection.

Update: 24 hours later and still blocked. I’m mad about this now. If I’d been aggressively scraping one site repeatedly I’d understand. But some lightweight automated screenshotting of sites all over the Internet once every few seconds should not cause me to get blocked forever. I had to set up tinyproxy just to book a damn hotel. Ironically I can’t read Akamai’s own support notes on their bans, since they are all hosted on Akamai.

Update 2: the block slowly expired starting the morning of July 23, 3 days after my screenshotter tool stopped running. It wasn’t lifted all at once. justice.gov was the first site I noticed working again. Later tripadvisor.com. schwab.com took the longest and for all I know some Akamai sites are still blocking me. Schwab showed a really irritating failure mode; the main page loaded fine (hosted elsewhere?) but the login iframe was blocked. At least that showed an error; I also hit some sites where some invisible AJAX refused to load, invisibly.

Update 3: if this happens again, try the tool at https://www.akamai.com/us/en/clientrep-lookup/

Tech notes on my linkblog

2022/07/182022/07/21 ~ nelsonminar

My linkblog website is done, see it here: https://www.somebits.com/linkblog/

After a week or two of tinkering (and more research) I’ve come up with a nice clean web page for my linkblog. Each post has its own box with an image preview. I’m pretty excited how it looks. Also reminded myself of a cardinal rule of design projects; get something that looks good and functional ASAP. It’s so motivating!

Here’s screenshots of design iterations (from newest to oldest).

It’s a pretty straightforward layout but there’s one clever thing I’m doing, highlighting positive sentiment posts (white) from negative (black). I like how I iterated from a simple vertical design to something with alternating left and right. It’s idiosyncratic but I think it’s interesting.

Code architecture

It’s all pretty simple. Custom built Python static site generator driven from Pinboard data making liberal use of external Unix tools for complicated things. Very basic hand coded HTML and CSS, no frameworks, no Javascript. Static sites for static data!

The main loop is to sync data from Pinboard once an hour into sqlite. Then run a process to generate image previews for new links. Finally render HTML from the data in sqlite and push to a web server.

I used my friend Dan’s PugSQL for database access in Python. Very nice little tool; you write actual SQL code with just the lightest sprinkling of metadata for variable names, then execute it in your Python code. It takes some of the hassle out of writing SQL in Python without doing anything too magic. Worked great for this project. I only have two tables in SQLite and a total of 7 SQL queries. One nice thing about sqlite is there’s less need to optimize; I’m happy to make several SQL calls to render one post rather than try to do some complicated join to minimize database round trips.

Previews

I’ve written several blog posts about preview generation and I’m glad for all the research I did. I settled on a single 320 pixel wide image for a preview, height up to 320 with a preferred height of 180. My code supports multiple engines for generating an image preview but in the end I’m only using two. Metascraper, a standalone program that analyzes HTML for OpenGraph tags, etc to select an image. And shot-scraper, a standalone program that takes screenshots with a headless browser. I also tried the linkpreview service and a Python library called webpreview but they were redundant with metascraper.

Multiple engines yields multiple candidate images. I pretty much always take the Metascraper image; it’s available for about 90% of the links. In a few hand-coded exceptions I’ll prefer the screenshot. (Metascraper offers crappy images for Hacker News posts and Wikipedia pages that don’t have featured images. Also sometimes it returns an SVG image which the rest of my code can’t handle.)

Overall I’d say 90% or more of my posts have good images. A few of the Metascraper images are turkeys. Some screenshots are marred by cookie popups, etc. One neat thing is that web pages that don’t screenshot well also tend to be the ones with thoughtful OpenGraph previews.

The preview images are downloaded or generated and resized to 640 wide with lossy cwebp. Major savings there, the results are maybe 10% the size compared to just serving the actual image. Screenshots average 27kb, metascrape’s website images averages 50kb with a few oddball much larger ones because I couldn’t resize the source image. (cwebp can’t deal with animated GIFs!)

Page size and performance

I’m going for a fairly maximalist presentation so I don’t care too much about page size. 100 links on a page makes for a 5MB download, almost entirely the image previews. Google’s PageSpeed Insights gives me a pass on “core web vitals” and an 85 on Performance, which isn’t awful. The main complaint is just that it’s a 5 megabyte page with 100 medium sized images on it. I could easily solve this by just including fewer links on the page 👿. Or maybe go to an infinite scrolling / load on demand design, but the extra complexity of that does not seem worth it to me.

The one performance thing I’m not happy with is the image reflow. I don’t set an explicit height for the images because I’m using CSS to calculate the height. I’m not cropping the images but relying on overflow: hidden to contain them to max-height 320px. This all looks good, is less work for me, and has the nice property that the full preview image is available if you click on it. But it does cause a lot of reflowing while the page loads. I should revisit this.

Update: Thanks to a hint from Thomas S I now am including the native file image size in the HTML img tags. Combined with CSS rules for max-width: 100% and height: auto and the browser does a nice job laying out the images before loading them. I also added loading: lazy to the image tags at his suggestion which is a big help. The new mystery is font loading; Firefox loads them after the visible images. But that’s a deep rabbit hole.

Responsive design

This is the first time I’ve coded a mobile-friendly view for a website with a responsive design. All by hand, no framework to help me. My design is intended for desktop use, I worked with an 800px wide frame as the core design element. 320px for images, about 440px for text, and some gutters. But I wanted it to look reasonable on a phone too. So I read a couple of MDN articles and figured out how to do responsive design. It’s simpler than I realized.

The first principle is that you need to set <meta name="viewport“>. Without it mobile browsers seem to go into some compatibility mode where they render the page for a screen that’s 960px wide and then shrink it to fit in the actual CSS pixels for the screen, often about half or a third the size. That’s why fonts are so tiny for non-mobile websites on phones! So you set the viewport to width=device-width and now the page will render at the phone’s native CSS width, typically around 400 CSS pixels. (Which is probably actually 800 or 1200 physical pixels thanks to high DPI.)

The second principle is just to design the site so the page looks good at various widths in a regular desktop browser. I originally hardcoded a rigid width: 800px in my main display element which looks terrible if you shrink the browser below that (it clips). Making a more flexible layout works better. I was a bit constrained because I really wanted the preview images to be fixed at 320px, but now the text could go anywhere from 440px to about 250px and still look good.

The third principle is media queries to define explicit CSS rules for different screen sizes. My main rule is if the screen is smaller than 800 pixels to render differently; switch to a single column layout (images below text), use smaller margins, etc. For smaller mobile size screens I knock the font size down a bit too, to make more text fit. I did the opposite of the recommended “mobile first design”; my default CSS rules are for wide desktop screens, then I have overrides for small mobile screens.

Firefox has a great “responsive design mode” that lets you simulate how your page looks on various real world phones. Big help.

Future work

I’m pretty happy with where this stands. But I’m sure I’ll tinker.

One future work idea is to do some sort of pagination / archive view. Right now I’m just showing the last 100 links, without even dates displayed. That’s fine, there’s an archive view on Pinboard. But maybe I’ll eventually do more on my static site. See notes above about infinite scrolling as a possibility, too.

The other thing to tweak is continue to get better previews. Metascraper could be improved. Also I realized I could use Feedly to generate previews for me; they have an API where I can pull their preview images. There’s diminishing returns for this kind of work.

Finally I have one unexplored design idea. Display each preview as a full-bleed image in the background of a box for the post. Then put my descriptive text on top of the preview image, using some combination of blur and shading and maybe text halos to make it readable. It’s a pretty aggressive design but I’ve seen stuff like that look good before and it’d be fun to try.

WebP conversion

2022/07/162022/07/20 ~ nelsonminar

I just spent about an hour learning about converting images to WebP. For my linkblog; I want my link previews to be small files. Target display is 320x180ish images.

Here’s where I landed to resize everything to 640 pixels wide with a fairly high quality:

cwebp -quiet 
  -af -q 90 
  -resize 640 0 
  -metadata all

These are guesses but seem to work about right.

-af -q 90 are the quality settings. 90 is fairly high, the default is 75. af means “spend more time making it look good”, it seems to take about 2x as long. I also experimented with the -m setting to trade off time vs compression quality but the default -m 4 seems fine to me.

The resize forces all images to a width of 640 and whatever height is natural. Twice the target display resolution for retina display.

The one surprise is some of the images (but only some) seem to get a little brighter after conversion. I thought that was a symptom of ICC color profiles not being applied right, which is why I added the -metadata all flag. But it’s still happening. I don’t care enough to figure it out; it could just be the difference in the browser resize algorithm vs cwebp.

Starting with a bunch of 1280×720 screenshot PNGs the webp versions are about 10% the size. If I leave out the image resizing it’s about 25%. That’s mostly thanks to the lossy compression which is fine for my purposes given I’m resizing too.

Overall this got my full page down from 24MB to 2MB. Worth the time!

Website previews: images for unfurls

2022/07/082022/07/20 ~ nelsonminar ~ 1 Comment

Some notes on another way to generate one image per link for my linkblog: unfurls. Those are the site previews things like Twitter Cards or Slack Unfurls or Facebook Link Previews. They try to extract a few words and an image from the page; the resulting preview is structured data, not just a picture. Most of these basically work by looking for oEmbed tags, OpenGraph tags, or Twitter’s card tags and then just guessing if those metadata tags aren’t present.

Most unfurls are a mix of text, a little structured data, and an image or video embed. I’m focused on the image. Possibly I could generate an image from textual data too.

Page metadata: OpenGraph and friends

Facebook, Twitter, Slack, etc rely primarily on page metadata to get a text summary and example image. Only works for sites that publish the metadata but since it’s so commonly used now a lot of sites have it. These formats all offer one image. By custom that image tends to be 400+ pixels wide and roughly 2:1 aspect ratio.

OpenGraph is the big standard, originally from Facebook, here’s a useful description of how it works in practice. Dates to 2010. For my purposes the key tag is og:image. og:title gives text.

Twitter’s cards are a popular expansion of the OpenGraph idea dating to 2012. More details on supported tags here. twitter:image is the most relevant, and possibly twitter:player. There’s more tags for textual metadata than OpenGraph.

oEmbed is the oldest standard, from 2008. Metadata isn’t in the page itself, instead you have to make a query to a special JSON endpoint. Interesting image fields are thumbnail_url and maybe the photo and video types. Lots of textual metadata too. I don’t know how popular oEmbed is these days but Slack supports it.

schema.org Articles also are relevant, particularly since Google supports them. It started in 2011 as a way to tell search engines how to summarize a page. Metadata is in the page in one of three encodings (sigh). There’s also an overwhelming number of tags; maybe thumbnailUrl is what I want? or image? Honestly this is all so complicated I’m not in a hurry to learn more.

Without metadata

Metadata is great if it’s present, what do you do if it’s not? Maybe pull an image from the page? This approach doesn’t seem so popular but Feedly does it. There’s a description in item 4 of what they do. Boils down to images tagged webfeedsFeaturedVisual, or else the first big image in the story, or else the biggest image on the page.

Another option would be to take a screenshot of the page itself. I have some notes on that but it’s not very simple to get a good result.

A third option would be to use the favicon. Not awesome, but it is there.

Heuristics for combining preview data

So now we have myriad ways to find an image for the site; which one do we use? Slack has a great post (from my friend Matt!) about how they generate unfurls. They give oEmbed priority, then Twitter+OpenGraph, then HTML meta tags as a fallback. They also seem to combine data from multiple sources; their cards have room for a lot of text.

Tyler Young from Felt recently tweeted about their solution. Light on details though, mostly it’s just “it’s complicated”. That echoes what I’ve heard informally from folks who’ve worked on this problem at various companies. Lots of one-off hacks in the end.

Code Tools

All this stuff is complicated and of general interest; is there a reusable library for generating previews I can just use? npm.io has a big list, here’s some highlights from it and some others.

metascraper (GitHub) looks to be the most active of the NPMs. MIT license, active development. Has a lot of configurable options and also custom code for popular sites. This looks like a strong contender I should evaluate further.

iFramely (GitHub) is mostly a hosted service but their parser is on GitHub with an MIT License. Javascript, looks like fairly active development. Also promising.

unfurl (GitHub). TypeScript, active development, MIT license.

Link Preview (GitHub) generates OpenGraph, TwitterCard, and oEmbed previews of pages. The source is Javascript and MIT licensed. Last updates about 2 years ago. Looks like a small project.

pyUnfurl (GitHub) does the various metadata things and falls back to favicon. Python, MIT License, small project with last main development 3 years ago.

extruct is in Python (BSD license) and recently updated. OpenGraph but not Twitter (yet, there’s a pull request).

webpreview is in Python (MIT license) and was last updated two years ago. OpenGraph, Twitter, Schema, or else “from the webpage’s content”.

Metaphor is the Python code with the most Google juice but hasn’t been updated in 5 years.

Buying a service

Another option for unfurls is just buying a blackbox service for this. Embed.ly is the one I know; part of the value ad is they have “700+ Official Content Providers” who they’ve worked to interoperate with. Their $9/mo product wants to embed “cards” though, which I take to mean Javascript running on your site. Not for me. The $99/mo product is a lot more flexible but is more than I’d want to spend. iFramely also looks promising although similar problem with pricing / embeds.

Update: Ryan B mentioned using linkpreview.net, it looks promising. It has a JSON API that returns data including an image. The free plan might work for me or paid plans start at $8/mo.

It’d also be possible for me to do a bit of guerilla scraping for my own “service”. Post the link to Feedly or Twitter, then capture what preview they come up with. That might actually work pretty well, at least right until it doesn’t.

My Plan

I should start with Metascraper. It looks good, my only complaint is dealing with Node. Alternately I think I’m stuck writing my own thing, a lightweight OpenGraph / Twitter card scraper plus reimplementing some image choosing heuristics. That’s a lot of work I’d rather not do.

Once I have an unfurl tool I can use I should test it on my linblog. I’m really curious how many of the links I’ve posted have useful metadata. I’d guess about half of them in the last 3 years. Seems worth doing a survey.

Screenshots may still prove to be a useful fallback.

Generating screenshots from web pages

2022/07/082022/07/20 ~ nelsonminar ~ 2 Comments

I’m working on a way to get one image for a link for each link in my linkblog. There’s two basic approaches: screenshots and unfurls. This post has my notes on screenshots, particularly Simon Willison’s shot-scrape tool based on Microsoft Playwright.

Screenshots boil down to running something like a normal browser and capturing the page after it renders. Sounds simple but the modern web with ads, cookie popups, etc makes this hard. The problem is getting a screenshot that shows the intended content. I tested about 50 links of mine with shot-scraper and maybe 40% didn’t work at all, or had a paywall notice or a cookie consent popup covering everything or otherwise didn’t show anything useful.

One straightforward solution would be just to take screenshots manually when adding a new link. I already like what I’m seeing when I choose to blog a link, why not take a picture? I don’t want to create new work in my flow of posting to a linkblog but honestly this wouldn’t be too much, particularly with a browser tool automating the screenshot process. A related solution would be to pay someone else to do the screenshot for you via Mechanical Turk or the like. But at my small scale I don’t think that makes sense.

shot-scraper is a promising new tool for automated screenshots. It’s a wrapper around Microsoft Playwright, an automation framework that wraps up browsers like Chrome or Firefox and makes it easily programmable. Simon’s shot-scraper puts a nice command line interface around it and while it’s written in Python, it’s more of a command line tool than a Python API. It’s fairly subtle and allows for extracting specific CSS pieces, running some Javascript in the page context, etc. I’ve had promising results with it so far but it needs some tweaking to get the more recalcitrant pages to display.

Ben Welsh’s News Homepages project is open source and also uses Playwright to take screenshots of newspaper pages. The clever thing there is a bunch of extra rules to make the 200 sites they care about look better. There are generic rules like “don’t show anything named popup_wrapper”. Also site-specific rules that boil down to hiding specific CSS classes or running a bit of extra code to adjust the page layout. The bespoke approach won’t work for me since I don’t have a short target list but some of the generic rules might work.

I wonder if it’s possible to install an ad blocker or other addons inside Playwright’s browser? (Simon thought maybe in the Javascript version, but not Python). That might be a good general purpose way to improve the screenshot quality.

Another tweak is to the user agent you send when taking the screenshot. Looking like a bot often gets you blocked so emulating a browser may be necessary. I have a suspicion that the screenshots from mobile versions of sites are likely to be better than desktop ones. Although Tyler warns that different sites work better with different user agents.

There are some web services that do screenshots that might work well: Site-Shot ($15/mo for 5000 screenshots), url2png ($29/mo for 5000), urlbox ($99/mo for 20,000), or ~~the old Thumbshots (shutdown)~~. Update: SavePage.io (free for 7000, $25/mo for 500k). A bunch more listed in urlbox alternatives. (Pikwy looked promising but didn’t work for me.) One thing to check with these is how easy it is to store the screenshot yourself; some seem oriented to them hosting the screenshot and charging per-view.

Overall I’m not very excited about screenshots and will probably not pursue this further for my linkblog. Unfurls / site previews seem a better path.

Nelson's log

A personal work journal

Linkblog