## Practical computer science

I read a blog post the other day that attempted to explain why computer science programs are “stagnating”. The content of the post was pretty unimpressive, but it got me thinking. There’s a quote from the founder of Make School, one of several companies that are bringing the “Learn X in 14 Days” programming books from the ’90s into the real world, that really sums up the attitude of the entire thing:

“Nothing I learned at UCLA helped me build my startup,” Desai says. “I had reached out to various CS and EE professors for help, and while they were enthusiastic about my work, they were unable to help me with the project.”

Basically, the complaint is that computer science programs don’t prepare students for what they expect to do after they graduate. If this is true, then it is a problem. While college isn’t supposed to be strictly job training, and, at least in a democracy, it is vitally important that education be much more than just job training, job training is still an important component. A computer science education should be useful.

I shared some of my thoughts on Reddit but I wanted to clean them up and elaborate a bit here.

There seem to be two groups of people who ask for more “applied” computer science programs. The first is made up of students who feel wronged by the field of computer science. They want more “useful” topics and less abstract theory, more code and less math, more frameworks and fewer broad principles, more “modern” topics and less historical background. The second group is companies that hire computer science graduates. They want employees who can walk in the door and be productive on day one, who already know the tools, languages, and frameworks that they use.

I have limited pity for the first group. Indeed, some schools may be better or worse than others, but as a whole computer science students already have the opportunity to learn about all sorts of applied topics outside of the classroom. Meetups, MOOCs, user groups, blogs, internships, open source, and professional conferences (which often offer free or cheap tickets to students) provide incredible opportunities for students to learn state-of-the-art, applied topics.

The ultimate problem I see is that all that abstract, theoretical knowledge students complain about is actually useful. Very few people will use all of it, but almost everyone will use a surprisingly large, and different subset of it. Removing theory to make room for more application therefore won’t necessarily result in better, more capable programmers.

The second group who tend to call for more “applied” topics, companies, are, I think, rather short-sighted in their demands. Asking that employees, especially entry level employees, come in with preexisting knowledge of specific tools, languages, applications, and frameworks is foolish, and asking that they learn all of these things in school is absurd. There are literally hundreds of tools, and no matter which ones a school might choose to teach, most students would still lack many specific applied skills a given employer might demand.

Students who lack a grounding in computer science theory are more likely to struggle when tasked with implementing more complex systems. Perhaps more importantly though, they are less likely to be able to adapt to changing needs within an organization. If you hire a Rails or Django developer, then that’s what you get. If you later need that person to be a Android developer, you may be out of luck. However, if you hire someone with a solid theoretical grounding, that person may be able to take on all of those roles and more.

Of course I’m not trying to say that everyone who gets a computer science degree will end up an amazing developer (and as a corollary, not everyone who doesn’t study computer science won’t end up an amazing developer). But it is more likely that they will if they have the tools they need to become an amazing developer. It is also important to consider that there is a huge range of industries that employ computer science graduates, and each would ideally like people trained for their niche, but all are well-served by people with broad training.

To wrap up, computer science is not taught perfectly. The ideal balance between theory and practice surely changes from year to year, and maintaining a healthy debate is critical to help schools adapt. But I am always skeptical, and I think you should be as well, when I hear claims like the quote above that fall too far to one side or the other.

Image credit: Paul Keller

## Compulsive automation

Programmers tend to have a disease: we compulsively automate. That is, no matter the task, we are always on the lookout for ways to automate it regardless of how much (or little) we gain by doing so. The problem is that we too often end up with very small, or even negative gains.

Automation can be viewed as a kind of optimization, and everyone knows that optimizing too early can cause problems. Certainly a task shouldn’t be automated unless it will need to be carried out repeatedly and doing so will be costly. However, compulsive automation seems to come in a few other varieties as well.

The first is when so much time is spent on automation that it kills, or disproportionately hinders the overall project. In this case, there might be very good reasons for automating, but the resources to actually carry it out may not exist.

This can happen at the very beginning of a project. Prematurely setting up continuous integration, version control, and a reproducible development environment can, in some cases, prevent a project from getting off the ground. Automation at the “end” of a project can also lead to problems. I personally struggle with this more than any of the others. Deploying an application is a great example.

You’ve got your snazzy new app (or whatever) working and you’re ready to show it to the world. You could set up a snowflake server, but everyone knows that’s a bad idea. So you decide to automate. You then proceed to fiddle around with Chef or Ansible until you run out of steam and never actually deploy anything, or you deploy but never actually make any updates (which would have justified the automation effort).

In the long run, automating deployments is the right thing to do. But when you’re deploying a prototype or a side project the extra time required up-front can hurt your momentum. It doesn’t matter how much theoretical time you’ll save in the future if no one ever sees your work.

A second variety of unwise automation is when automation reduces the burden on the person doing the automating but transfers it to others, sometimes even magnifying it in the process. The implementation of information systems tends to be an ugly business. We often forget that many of the ugliest systems actually seem clean and elegant to their users. Sometimes the price of this elegance is manual effort behind the scenes. This effort can often be eliminated, but doing so usually requires either significant technical investment or the imposition of constraints on end-users. I noticed a great example of this phenomenon on Hacker News the other day (which actually inspired this blog post).

It was revealed that the volunteer who has been (manually) aggregating hiring-related posts for the past four years has decided to step down. Shortly thereafter, a specification for hiring posts was proposed. The spec itself isn’t bad, it tries to split the difference between human- and machine-readability and does a decent job of it. However, it would require anyone who wanted to post a job to read, understand, and follow the spec.

This wouldn’t be a big deal if the same people posted jobs over and over again, but the community discourages posts from recruiters and HR employees. This means that most people who post will only post occasionally, increasing the odds of having to re-learn the spec every single time.

It seems reasonable, given that someone was willing to do the job manually for four years, to assume that the amount of effort involved in aggregating jobs posts is manageable. So a spec would save a relatively small amount of time behind the scenes, but at a large (total) cost on the part of the posters.

To be fair, a spec for hiring posts might make them easier to search, but a couple bullet points with suggestions for how to write an effective job post would solve this problem just as well.

The final problematic form of automation is when the automation itself becomes a larger project than the original task. I think this usually happens because we delude ourselves into believing that the automation project will be “easy”. Even when the automation is fairly straightforward, feature creep can turn a 10 line shell script into a 10,000 line application before anyone even realizes what is happening.

However, this kind of automation isn’t always a bad idea. If the automation tool can be released for use by others, the total time saved across all users may be greater than the time it took to build the solution. We see this dynamic a lot with open source software. Of course the time must still be justified internally, perhaps trading time for goodwill from the community.

So what is to be done? Certainly we shouldn’t stop automating, the benefits are just too great. What we should do is always consider the context in which an automation project exists. We should think explicitly about the benefits of automating and when they will be realized, whether automating will actually put an additional burden on users, and whether the realistic cost of automating is actually worthwhile.

Image credit: XKCD: Automation

## Units in Julia

I’ve been excited about Julia since before their first release (back then you had to build it from source). Lately I’ve been working on CGP.jl so I’ve been able to really immerse myself in the language and ecosystem.

I have always found the idea of associating physical units with values in a programming language to be interesting and potentially useful. We have languages with very powerful type systems, but we don’t typically have elegant ways of saying “this is a floating point value and it represents a number of centimeters”. These exist, of course, and implementing this kind of thing in an object oriented language is probably a pretty useful learning exercise, but again, it’s really about elegance and simplicity.

Julia has three characteristics that make something like this relatively elegant. First, it has optional static types and multiple dispatch. This should let us write a function that operates on “inches” but not on “centimeters” or “kilograms”. Second, it has functions as operators (and some nice syntactic sugar related to multiplication). This lets us override basic operations like addition. When combined with parametric types, we can even use a single implementation to handle multiple units. Finally, Julia supports rich Lisp-style (more or less) macros, meaning we can easily define a whole bunch of unit types and associated functions with a relatively small amount of code.

It is, perhaps, worth considering how we would do something like this in a functional language that supports pattern matching (since, to me at least, that would be the other obviously “elegant” way of solving the problem. Here’s a stupidly simple example in Elixir (which, by the way, also supports macros).

https://gist.github.com/glesica/51e2fa379e9c0105c302

Operator overloading is possible in Elixir, but the point is really that we can very cleanly implement a function that takes only particular kinds of units by agreeing to pass around tuples of a certain kind. We don’t have this sort of pattern matching in Julia, but we can match on types (multiple dispatch).

So here’s the code in Julia:

Let’s look at it a piece at a time. We’ve got three macros. Let’s take them in order. The first, `defunit` allows us to add a new unit to the units graph we will create. It requires a name and a parent (or “kind”). The name serves an obvious purpose, the parent is less clear. This macro also defines a shortcut function that lets us convert another unit of the same kind (sharing a parent) to this one. So we can do things like `In(x)` where `x` is a `Cm` value.

Each kind of unit has a base unit. This is the unit through which all conversions that aren’t specifically specified will be done. We can define a base unit for a kind using the `defbase` macro. For example, in the snippet we define centimeters to be the linear base unit. This means that to define an inches unit we need only provide a conversion to centimeters to be able to convert between inches and any other linear unit (such as feet, in the example). This doesn’t work perfectly since we might end up with significant floating point error or overflow, but it works well enough.

Last of all, we can define a conversion using the `defconv` macro. We must define a conversion from a unit into the base unit, but we can also define other conversions if we’d like better accuracy.

Next, let’s take a look at this line: `*{T &lt;: Unit}(x::Real, ::Type{T}) = T(x)`. Here we have abused the multiplication operator and made it into a pseudo constructor. Why? Because Julia has some fancy syntactic sugar that lets us write `2x` instead of `2 * x`. This was added, presumably, to aid in translating mathematical formulas to code. For our purposes, it allows us to write something like `2Cm` and have it mean exactly what it looks like it should mean. We need to be careful about operator precedence of course.

Finally, we define a bunch of arithmetic functions / operators. Since all units have more or less the same form, what we're doing here is enforcing the rule that you can only multiply a value of a particular unit by another value of the same unit. This means, then, that something like `2Cm + 4Ft` is an error (since who knows what the resulting unit should be?! To make this work we would need to explicitly acknowledge the units by converting one of them like so: `2Cm + Cm(4Ft)`.

I generally like this solution to the problem. Julia provides all the tools necessary to solve this problem in a fairly elegant manner.

Image credit: Scott Akerman

## Dynamic Pools in Go

Recently, I wanted to make use of the pool pattern, which is generally pretty simple in Go. Specifically, however, I wanted to be able to dynamically cap the level of concurrency for any given set of tasks submitted to the pool.

To clarify, let’s say we have a pool that consists of $N$ workers. For any given job A, consisting of tasks a₁, a₂, …, aₙ, we want no more than k of the n tasks in A to run concurrently, where k ≤ n and k ≤ N. My use case is a system to test HTTP resources. Each job might be a specific set of endpoints. I might want to hit some more “gently” than others, hence the need to dynamically cap the level of concurrency.

Having worked with the Erlang ecosystem a bit, I really like the idea of passing messages between independent “processes”. This is a very natural and fairly simple abstraction.

Go is a little different, though. In Erlang you create a process and then pass around its PID, which can be used to send it messages (like having its address). In Go, a goroutine (which we can think of as a process) is decoupled and independent (although shared mutable state is still possible).

In order to communicate between goroutines, Go makes use of channels, which are like pipes or queues and can be one- or two-way. This means that if you want to spawn a goroutine and then pass it messages, you need to give it a reference to the channel you plan to use.

I’ve included a very simple example below (note that there is a race condition in this code, it doesn’t matter because the point is to illustrate how channels work). In this case I have made the channel accessible to the goroutine using a closure, but I could have also passed it into the function.

https://gist.github.com/glesica/96c398c83f2648c6eed9

This basic pattern can be used to construct a goroutine pool. We can spawn several goroutines that listen on a channel until they receive a task, complete it, send the result back through another channel, then start listening again. They’ll stop listening when the channel is “closed”. We can use a Wait Group, which is similar to a semaphore, to make sure we don’t move on until all the workers are finished.

https://gist.github.com/glesica/7db5e0308589dbfe149a

This is great except for the fact that, given a set of tasks as described above, we might execute up to N of them concurrently depending on the workload of our pool. We need a way to group tasks together into what I called “jobs” above.

One solution (there may be others) is to take advantage of the fact that channels in Go are themselves just values, so they can be passed through other channels. Instead of workers that pull jobs from a shared queue, they can pull queues (channels) from a shared queue (channel).

https://gist.github.com/glesica/2c4ddcc5c9e71cba442b

Note that now our jobs channel is a channel of channels of integers. So before we submit the tasks associated with a particular job, we decide how many of the workers may work on these tasks concurrently and we submit the task channel that many times. The we feed the tasks into the task channel and, at most, that many workers receive our tasks.

A couple of caveats are in order. First, it is perfectly possible that fewer than the maximum number of workers will process the tasks if the rest are busy. In this case a worker will grab a task channel that has already been closed and immediately discard it. For this reason, this strategy might not be the best for long-running pools (eventually you could end up with a lot of closed channels in your queue, maybe that causes a problem for you, maybe it doesn’t).

Another thing to note is that each job (group of tasks) now requires its own channel. This might not be great for situations where each job is quite small and there are many jobs.

In any event, you can play around with the code and see for yourself that it works. Change the “1” on line 26 to a “3” and you should notice that the results come back mixed up instead of in order.

Image credit: Thomas Hawk

## Emacs.

The other day I opened up Vim and a bunch of formatting was messed up and things weren’t refreshing properly. Some update had probably broken something. Then I realized that my Vim config was a massive mess (you never realize stuff like that until something breaks).

I’d intended to switch to Emacs eventually, it had been kind of an elaborate dance, but I had always suspected I would end up there. I really like the idea of Lisp and I think using Emacs is actually one of the better ways to get comfortable with it, plus it’s a decent editor, or so I hear.

So now, perhaps sooner than expected, I am an Emacs user (since a couple weeks ago). I’ve got several friends helping me out and providing suggestions, and I’ve already got quite a bit of useful stuff set up. My configuration is on GitHub, because why not?

So far I am quite pleased, but wow, this is going to be a long, interesting journey.

Image credit: XKCD: Real Programmers