Whois is one of the magical old Internet protocols, going way back to 1982 in the time before TCP was even the standard. It still mostly works. I was curious how it’s scaled, in particular where the database state is held. The Wikipedia article is useful.
Long story short, each top level domain like .com has a whois server. Some of those servers maintain “thick” databases where they have a full copy of the database. Ie, the .ORG whois server has the full .ORG whois database itself. Others are “thin” and delegate queries to the domain registrar of record; for somebits.com, say, the .COM whois server looks up the name and sees TUCOWS is the registrar so sends the request for somebits info off to TUCOWS. It’s all basically a pretty centralized system and seems to work. RWhois is the proposal to replace the centralized system with a hierarchical one.
There’s about 100 million .com domains. Assuming a generous 4 kilobytes of data per record, that’s 400 GB of data. That’s a fair amount of data but not a whole lot, although there sure must be a lot of queries. Whois updates propagate fast, often in seconds, although TUCOWS has a 15 minute delay.
You can see this delegation when making a whois query (testing with the Mac command line client). Asking for somebits.com, the first query goes to a Verisign server who gives back a partial response, including a Whois Server at TUCOWS, the registrar for the domain. Then the whois client makes a second request to TUCOWS. Asking for somebits.org gives you the reply straight from a server at afilias.info, who I guess is running the whois for .org, a thick implementation.
How do you find the whois server for a particular TLD? There’s no standard, but IANA runs a database you can query via whois. Ie: whois -h whois.iana.org com. I don’t know if the whois tools use that online. My Mac client seems to know directly where to send whois queries for weird domains like .museum, .xxx, and .io. (Although it has a bunch of hilarious options to manually choose whois servers, like Russian and Caribbean authorities.)
The most important thing a whois server tells you is the DNS authority for the domain, where to go to look up IP addresses. In retrospect it’s odd that query is not part of the DNS system itself, is instead delegated to this other protocol. Then again the typical DNS client never does the whois requests themselves, they just ask a DNS resolver somewhere. And DNS has a huge amount of caching.
I have a lot of respect for the core old Internet protocols. Many have been in service for 30 years and still work great, despite a zillion-fold increase in Internet size.