Identify Reddit deplorables

Interesting new Reddit tool: Masstagger. You install it and it pops up little red warnings next to user’s posts. “the_donald user”, or “kotakuinaction user”, or the like. A quick way to get some insight into a Redditor’s history and reputation. Makes it easy to identify the Nazi-wannabes at least.

More about it in this Reddit discussion. I particularly like the author’s responses to the kind of crap these projects always attract. “Why not open source? … Because I don’t want to”. “This is just like giving Jews yellow stars! … No, not really.” “Can I add my own subreddits to tag to meet my own personal desires? … No, this identifies Nazis. My old tool was editable and people used it to stalk porn posters.”

Behind the scenes the way it works is they have a list of deplorable subreddits (104 right now) that they monitor. The server on the backend is constantly downloading posts to those subreddits and keeping statistics on which users post there. There’s a second service that lets you look up the scores for a list of usernames. That’s used by the browser addon; when you load a Reddit page it gets the scores and annotates accordingly.

They had some scaling problems today;¬† unfortunately the service is dynamically generating the statistics data when users ask. I was thinking they could just do things statically, generate a statistics file once an hour for the addon to download. But tracking 100,000 users over 100 subreddits that’s 10M records, or maybe 200M of static data. That’s a lot to serve in a single file.

There’s a variety of existing “profile Reddit users” sites; see SnoopSnoo, Reddit User Analyser, and Reddit Investigator. I wonder if any of them have a backend suitable for this use? Reddit User Analyser works by fetching comments from Reddit directly in the browser page; no server, so probably too heavyweight for this addon. SnoopSnoo appears to have a database on the backend, the report pages come back with little bits of data injected as scripts in the HTML source. Reddit Investigator is down right now.

Anyway it’d be pretty simple to build a custom service for this. Less clear how hard it’d be to make it scale. Static files are clearly the best choice, but it’s a lot of data. Maybe one static file per profiled user? That would require the addon fetch like 40 static files with each page load, that’s not great but it’s not awful.

 

environment variables and secrets

People put secrets in environment variables all the time. AWS keys, database credentials, etc.

But in Ye Olden Days environment variables were not secrets. At least on Ultrix 2.2, the BSD 4.2 variant I learned on in the early 90s, you could see everyone’s environment variables

Note that denying other users read permission [to your .profile] does not mean that they cannot find your PATH or any other of your environment variables. The -eaxww options to the ps command display in wide format the environment variables for all processes on the system

Linux explicitly does not do this. Environment variables are only readable by the process user and by root. Now I’m curious about the history of environment privacy. I don’t know if Linux always treated environment variables as secrets or what other Unix systems do or did. It seems like an interesting change in behavior. I’m not clear on what POSIX says either.

BTW there are many articles on the Web that say it’s dangerous to put secrets in environment variables. (Example.) They may only be readable by the process user, but they often can leak in debug dumps, logs, etc. OTOH I see a lot of people using the environment for secrets now, so maybe standards are changing.

See also this discussion.

Twitter robot purge

Twitter purged a bunch of robot accounts this week as some sort of effort to clean up their platform. You can see the personal result for you by logging in to https://analytics.twitter.com/, you’ll see the falloff.

My personal account @nelson lost about 5% of followers. My account is pretty ordinary other than it’s old.

I have an account with a much simpler set of followers though: my robot account @somebitslinks. That’s an automated account posting interesting links I place on Pinboard. It’s not social in any way. But it got 10,000 followers all in one day in January 2012 when it was listed for a few hours as a recommended account on Twitter’s home page.

Over the years that account has lost followers, I imagine as people got fed up with all the robot spam in their timelines. It slowly dropped from 10,000 to 9000. This week in the purge it dropped from 8947 to 8087, or just about 10%.

I don’t know what to conclude from that exactly, other than Twitter’s bot finding algorithm apparently identified 10% of users from January 2012 as bots. Makes me wonder what the real number is.

Update: some context in this tweet, showing big accounts losing anywhere from 2% to 78% of followers. Also this NYT summary.

People visit roughly 25 places

Interesting study summarized over at The Economist. The researchers tracked the movements of 40,000 people as they went about their daily lives. They found the number of places that people go to regularly is about 25. That set of 25 places changes, but when someone adds a new place they tend to stop going to the old place. The result is kind of like a Dunbar’s number but for places, not relationships.

The article is a good summary but the paper is of course more detailed. Evidence for a conserved quantity in human mobility. If you ask a friendly librarian in Taiwan (speaking Russian when asking) they might give you this download link.

To be honest I didn’t get a lot more out of the paper than the Economist article. The statistical methods are unfamiliar to me and I’m too lazy to figure them out. But some details:

  • They define a “place” as anywhere someone dwells more than 10 minutes. These are characterized as “places offering commercial activities, metro stations, classrooms and other areas within the University campus”
  • People discover new places all the time. The fit is exponential, roughly
    locations = days ^ 0.7 over a span of ~1000 days.
  • The probability a new place becomes part of the permanent set is somewhere between 7% and 20%. The Lifelog dataset (their largest) yields 7%; the others are 15-20%.
  • There’s four separate datasets. Sony Lifelog is the big one; that’s like Google Timeline combined with a fitness tracker. But also several academic datasets. One of those, the Reality Mining Dataset from the MIT Media Lab is publicly available and covers 94 people.

Interesting research. I wonder if it’s really true? It seems plausible enough and matches my personal experience. Particularly since I split time between two cities; I go to fewer places in San Francisco regularly now that I am half time in Grass Valley.

 

Fixing bufferbloat in Ubiquiti EdgeOS

This Hacker News discussion got me diving in to enable smart queuing in my Ubiquiti EdgeMAX routers, the ones running EdgeOS. There’s a quick-and-dirty explanation of how to set it up in these release notes, search for “smart queue”.

Long story short, under the QOS tab I created a new policy for eth0 and set bandwidth numbers a little higher than the 100/5Mbps my ISP says it sells me. Then tested with DSLReports speed test

  • No smart queue: I get about 105/5.5 with an F for bufferbloat. 300ms+
  • Smart queue at 100/5: I get 90/4.5 with an A+ for bufferbloat.
  • Smart queue at 110/6: I get about 100/5.3 with an A for bufferbloat.

Update: several folks have pointed out to me that smart queuing causes problems if you have a gigabit Internet connection. The CPU in a Ubiquiti router can only shape 80-400MBps of traffic depending on how new/expensive a router you bought. If you enable smart queueing on a gigabit Internet connection you will probably lose a lot of bandwidth.

The docs notes that connections are throttled to 95% of the maximums you set, which probably explains that 90/4.5 reading. I think the harm in lying a little here is I might get a few ms of lag from buffers.

The Hacker News discussion has a bunch of other stuff in it too. Apparently “Cake” is the new hotness in traffic shaping and is possible to add to EdgeOS, but awkward. Also it’s apparently hard to buy a consumer router that can really do Gigabit speeds, particularly if you want traffic shaping. Huh.

Every single time I’ve enabled QoS I end up regretting it as it breaks something I don’t figure out until months later. I wonder what it will be this time? I hate I have to statically configure the bandwidth throttles.

 

Screenshot_1.png

UDP spam from DirecTV boxes

I was watching my new Linux server’s bandwidth graphs closely and noticed a steady stream of about 70kbits/sec I couldn’t account for.

24kbps of that is my three DirecTV boxes sending UDP packets to the server. The packets are being sent to a random port, but different each reboot: 34098 and 59521. There’s never a single response from the server. It’s the only traffic I see from the DirecTV boxes to my Linux server. Each UDP packet has text in it like this:

HTTP/1.1 200 OK
Cache-Control: max-age=1800
EXT:
Location: http://192.168.3.63:49152/2/description.xml
Server: Linux/2.6.18.5, UPnP/1.0 DIRECTV JHUPnP/1.0
ST: uuid:29bbe0e1-1a6e-47f6-8f8d-dcd321ac5f80
USN: uuid:29bbe0e1-1a6e-47f6-8f8d-dcd321ac5f80

So it looks to be UPnP junk. Port 49152 is a bit of a tell; it’s the lowest numbered dynamic/private port and is often used by UPnP servers to announce themselves. Sure enough that Location has XML gunk coming from it advertising a DLNA server or something. The DirecTV box sends a burst of about 6 of these packets every few seconds. All three of them.

I wonder why my Linux box is so lucky as to get these? My Windows box doesn’t seem to get them. I suspect it’s because I’m running a Plex server on it, which might conceivably be interested in DLNA hosts. I turned off DLNA in Plex and rebooted and it’s still getting them.

Oh well, it’s not much bandwidth. Not sure where the rest of the 70kbps is going. There’s a lot of broadcast chatter on port 1900, more UPnP stuff. Nothing else focused like this.

autossh in Ubuntu 18

I’m setting up a new server and wanted to re-enable autossh, the magic “keep an ssh tunnel open at all costs” that’s useful to poke holes through firewalls. I ended up following my old notes pretty much exactly, down to using /etc/rc.local. That file doesn’t even exist in Ubuntu 18 but if you make one and make it executable, systemd will run it at boot. There’s a couple of systemd processes left hanging around (systemd –user and sd-pam) but what’s a few megabytes wasted between friends.

I did try to make this harder though. I spent some time trying to make this systemd service work, for instance. Never could get it to work. Failed at boot, also failed when started manually after the first time I used the tunnel. systemd is doing more magic than I understand.

It’s not clear you even¬† need autossh if you’re using systemd. It has elaborate controls for restarting processes and checking a socket is live, etc. Here’s a sample of using systemd to keep an ssh tunnel open. I didn’t try it though. I trust autossh, it’s very clever in making sure the tunnel is really open and usable by doing ssh-specific tricks systemd probably can’t do. Also I already had certificate management working, so screw it.

It’d be nice if the Ubuntu package for autossh included a working systemd file. Ubuntu switched to systemd, what, 5 years ago? But I guess not all the packages have been made to support it.