Privilege for computer scientists

As I was walking home from work a few nights ago I was thinking about privilege. I was also thinking about the AI Google built that beat a human champion at Go. Most algorithms that play Go use some form of the Monte Carlo tree search (MCTS) algorithm, the Google algorithm is no exception, though MCTS is only a relatively small part of it.

I know a little bit about MCTS, having read some of the papers on it and implemented it in school (my AI played Connect Four). MCTS is generally most applicable on game trees with a finite depth. In other words, the games must definitely end at some point. This is not true of games like chess or checkers where, in theory, the game could go on forever if the players repeatedly make neutral moves (like moving two pieces back and forth forever).

The reason for this is that MCTS works by choosing moves at random for both players until the game ends, then recording who won. This process is repeated many times (usually until a computational or time budget is exhausted), at which point the move with the best simulated results is chosen. Obviously, the random games must be guaranteed to come to an end or the algorithm wouldn’t work very well.

MCTS operates based on the density of winning outcomes on a particular branch of the game tree. If move A at a particular point in the game results in a win (when random moves are chosen) 70% of the time, and move B results in a win 45% of the time, the algorithm will choose branch A (in reality it is a little more complicated than this, but the idea is the same).

This is pretty much how “privilege” works in real life. At any given node in the tree representing all the decisions each of us makes in our lives, there is some probability that a particular choice will lead to a good outcome. In other words, each branch has a particular density of good outcomes.

Privilege, then, is when a person has a higher density of good outcomes on all of his or her branches.

For example, I read recently that a child with wealthy parents who does not attend college is over twice as likely to end up wealthy than a child with poor parents who does attend college. So when young people decide whether or not to attend college, those with rich parents face better probable outcomes across the whole range of choices. That, in a nutshell, is privilege.

Grouping Tabs

This evening I wrote and uploaded my first Chrome extension (well, technically I wrote one a few years ago, but I never really finished it).

What does it do? It lets you group related tabs together and keeps them grouped together.

Why would anyone want to do this? At any given moment while I’m at work I’m monitoring at least two or three pull requests. I try to keep their corresponding tabs grouped together for easy access, but inevitably they become lost among the 20-30 other tabs I have open.

I could pin them, but that changes the semantics of the tabs themselves and hides the title (even when the title would otherwise be visible). So I have my email and calendar pinned, because I never close those. But I wanted an intermediate state for things like pull requests. Enter “Pseudo Pins“.

Pseudo Pins allows the user to specify one or more regular expressions, which are then matched against the URLs of the tabs in each window. Tabs matching a given expression are pulled to the left and grouped together. The leftmost tabs then correspond to the first regular expression in the list, and so on rightward. The list of expressions is persisted across browser sessions (and will sync across devices if Chrome is set up to do so).

The GitHub repo is here if you are interested:

Reproducible Research

According to this article, two economists attempted to reproduce a number of economics papers that were published in top journals. They were unable to do so in most cases, even when they enlisted the original authors to help.

This result didn’t shock me even a little bit. When I wrote my MA thesis in economics I wanted to employ a particular, and rather obscure statistical technique. I couldn’t find a single book on statistics or econometrics that contained a full description of the technique, it was apparently rather specific to the sub-sub-field in which I was working.

Over a dozen papers claimed to have used it or otherwise discussed it, but zero contained an actual description of what was done to the data to bring about the result.

I finally found a proper description in a masters thesis from someone at a university in Sweden (if I recall correctly) whose adviser had apparently just happened to know what was being done to the data and who had actually taken the time to describe it. The thesis was never peer-reviewed (although the student had apparently graduated successfully, so I felt comfortable relying on it). So while I still had to implement it myself and verify my results, at least I knew where to start.

The situation is even worse given that it is fairly rare (in my admittedly limited experience) for authors in economics (or other social sciences) to publish their original datasets and (perhaps even more importantly) the code that they ran to do their analyses. I suspect that many couldn’t even if they wanted to due to a reliance on tools like Excel and SPSS that do not lend themselves to replicability without significant extra effort.

This is not to say that economists are evil or that there is some kind of conspiracy (although some are evil, and there are almost certainly “conspiracies”, the replicability problem just isn’t evidence of it).

Part of the problem, I think, is that many, or even most, economists never learn about tools they could use to do a better job at promoting replicability. Version control (Git, Subversion) and tools like GitHub or self-hosted alternatives (why don’t universities run these for their faculty?) are a great start. Using proper statistical languages and doing 100% of analysis using code, not “interactive mode” would help as well.

However, the real key, in my opinion, is for people to get comfortable working out in the open. I publicly publish virtually every line of non-trivial code I write. A lot of it is complete garbage, but I publish it because there is simply no reason not to do so. I’m writing my computer science thesis entirely in the open, from the very first paragraph.

I do realize, of course, that the stakes for me are very, very low. Academia is not my career, so I don’t worry about getting “scooped”, or about being attacked by a colleague with a vendetta. But that just means that some people might want to employ a self-imposed embargo before releasing their work. Wait until your grant runs out, or until the paper is actually published, and then put everything online (and no, simply dumping a PDF on arXiv doesn’t count, do it right). I honestly believe that every field would be better off if this were the norm rather than the happy exception.

As an aside, for anyone interested, Roger Peng, a biostatistician at Johns Hopkins, has an excellent Coursera course on reproducible research. I watched some of the lectures and it seemed like a great course on an important topic. As a bonus, Dr. Peng is a fantastic lecturer, his courses on the R programming language are also top notch (and accessible enough for reasonably bright social science students).

Image credit: Janneke Staaks

Practical computer science

I read a blog post the other day that attempted to explain why computer science programs are “stagnating”. The content of the post was pretty unimpressive, but it got me thinking. There’s a quote from the founder of Make School, one of several companies that are bringing the “Learn X in 14 Days” programming books from the ’90s into the real world, that really sums up the attitude of the entire thing:

“Nothing I learned at UCLA helped me build my startup,” Desai says. “I had reached out to various CS and EE professors for help, and while they were enthusiastic about my work, they were unable to help me with the project.”

Basically, the complaint is that computer science programs don’t prepare students for what they expect to do after they graduate. If this is true, then it is a problem. While college isn’t supposed to be strictly job training, and, at least in a democracy, it is vitally important that education be much more than just job training, job training is still an important component. A computer science education should be useful.

I shared some of my thoughts on Reddit but I wanted to clean them up and elaborate a bit here.

There seem to be two groups of people who ask for more “applied” computer science programs. The first is made up of students who feel wronged by the field of computer science. They want more “useful” topics and less abstract theory, more code and less math, more frameworks and fewer broad principles, more “modern” topics and less historical background. The second group is companies that hire computer science graduates. They want employees who can walk in the door and be productive on day one, who already know the tools, languages, and frameworks that they use.

I have limited pity for the first group. Indeed, some schools may be better or worse than others, but as a whole computer science students already have the opportunity to learn about all sorts of applied topics outside of the classroom. Meetups, MOOCs, user groups, blogs, internships, open source, and professional conferences (which often offer free or cheap tickets to students) provide incredible opportunities for students to learn state-of-the-art, applied topics.

The ultimate problem I see is that all that abstract, theoretical knowledge students complain about is actually useful. Very few people will use all of it, but almost everyone will use a surprisingly large, and different subset of it. Removing theory to make room for more application therefore won’t necessarily result in better, more capable programmers.

The second group who tend to call for more “applied” topics, companies, are, I think, rather short-sighted in their demands. Asking that employees, especially entry level employees, come in with preexisting knowledge of specific tools, languages, applications, and frameworks is foolish, and asking that they learn all of these things in school is absurd. There are literally hundreds of tools, and no matter which ones a school might choose to teach, most students would still lack many specific applied skills a given employer might demand.

Students who lack a grounding in computer science theory are more likely to struggle when tasked with implementing more complex systems. Perhaps more importantly though, they are less likely to be able to adapt to changing needs within an organization. If you hire a Rails or Django developer, then that’s what you get. If you later need that person to be a Android developer, you may be out of luck. However, if you hire someone with a solid theoretical grounding, that person may be able to take on all of those roles and more.

Of course I’m not trying to say that everyone who gets a computer science degree will end up an amazing developer (and as a corollary, not everyone who doesn’t study computer science won’t end up an amazing developer). But it is more likely that they will if they have the tools they need to become an amazing developer. It is also important to consider that there is a huge range of industries that employ computer science graduates, and each would ideally like people trained for their niche, but all are well-served by people with broad training.

To wrap up, computer science is not taught perfectly. The ideal balance between theory and practice surely changes from year to year, and maintaining a healthy debate is critical to help schools adapt. But I am always skeptical, and I think you should be as well, when I hear claims like the quote above that fall too far to one side or the other.

Image credit: Paul Keller

Compulsive automation

Programmers tend to have a disease: we compulsively automate. That is, no matter the task, we are always on the lookout for ways to automate it regardless of how much (or little) we gain by doing so. The problem is that we too often end up with very small, or even negative gains.

Automation can be viewed as a kind of optimization, and everyone knows that optimizing too early can cause problems. Certainly a task shouldn’t be automated unless it will need to be carried out repeatedly and doing so will be costly. However, compulsive automation seems to come in a few other varieties as well.

The first is when so much time is spent on automation that it kills, or disproportionately hinders the overall project. In this case, there might be very good reasons for automating, but the resources to actually carry it out may not exist.

This can happen at the very beginning of a project. Prematurely setting up continuous integration, version control, and a reproducible development environment can, in some cases, prevent a project from getting off the ground. Automation at the “end” of a project can also lead to problems. I personally struggle with this more than any of the others. Deploying an application is a great example.

You’ve got your snazzy new app (or whatever) working and you’re ready to show it to the world. You could set up a snowflake server, but everyone knows that’s a bad idea. So you decide to automate. You then proceed to fiddle around with Chef or Ansible until you run out of steam and never actually deploy anything, or you deploy but never actually make any updates (which would have justified the automation effort).

In the long run, automating deployments is the right thing to do. But when you’re deploying a prototype or a side project the extra time required up-front can hurt your momentum. It doesn’t matter how much theoretical time you’ll save in the future if no one ever sees your work.

A second variety of unwise automation is when automation reduces the burden on the person doing the automating but transfers it to others, sometimes even magnifying it in the process. The implementation of information systems tends to be an ugly business. We often forget that many of the ugliest systems actually seem clean and elegant to their users. Sometimes the price of this elegance is manual effort behind the scenes. This effort can often be eliminated, but doing so usually requires either significant technical investment or the imposition of constraints on end-users. I noticed a great example of this phenomenon on Hacker News the other day (which actually inspired this blog post).

It was revealed that the volunteer who has been (manually) aggregating hiring-related posts for the past four years has decided to step down. Shortly thereafter, a specification for hiring posts was proposed. The spec itself isn’t bad, it tries to split the difference between human- and machine-readability and does a decent job of it. However, it would require anyone who wanted to post a job to read, understand, and follow the spec.

This wouldn’t be a big deal if the same people posted jobs over and over again, but the community discourages posts from recruiters and HR employees. This means that most people who post will only post occasionally, increasing the odds of having to re-learn the spec every single time.

It seems reasonable, given that someone was willing to do the job manually for four years, to assume that the amount of effort involved in aggregating jobs posts is manageable. So a spec would save a relatively small amount of time behind the scenes, but at a large (total) cost on the part of the posters.

To be fair, a spec for hiring posts might make them easier to search, but a couple bullet points with suggestions for how to write an effective job post would solve this problem just as well.

The final problematic form of automation is when the automation itself becomes a larger project than the original task. I think this usually happens because we delude ourselves into believing that the automation project will be “easy”. Even when the automation is fairly straightforward, feature creep can turn a 10 line shell script into a 10,000 line application before anyone even realizes what is happening.

However, this kind of automation isn’t always a bad idea. If the automation tool can be released for use by others, the total time saved across all users may be greater than the time it took to build the solution. We see this dynamic a lot with open source software. Of course the time must still be justified internally, perhaps trading time for goodwill from the community.

So what is to be done? Certainly we shouldn’t stop automating, the benefits are just too great. What we should do is always consider the context in which an automation project exists. We should think explicitly about the benefits of automating and when they will be realized, whether automating will actually put an additional burden on users, and whether the realistic cost of automating is actually worthwhile.

Image credit: XKCD: Automation