![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Another day, another networking event -- I'm slowly getting used to going to all these Boston Tech Meetups and such, to meet people, talk up Querki and start to understand how one gets an investment.
Along the way, I'm chatting with lots of folks, and a remarkably large fraction lead off with, "Well, I've always been doing X, but I want to learn to code". (Last night's was a fellow who does financial compliance work for one of the large funds.) These folks are usually self-taught, and tend to be very self-deprecating about the fact that they didn't go to school so they don't *really* understand programming. A couple of the programmers I was with and I got chatting about that, and the fact that, yes, the best way to learn to program is by doing. A degree in CS is helpful, but mostly in that it teaches you some of the underlying theory for programming *well*; the nuts and bolts change so often that the details you learn in school will only be useful for a limited time anyway. Somewhere in there, I asserted that you could probably list all of the most-useful bits of theory and practice in one brief talk anyway.
So, here's a challenge: help me figure out what those are. What are the key engineering principles that *every* programmer should know, that probably aren't obvious to a newbie and which aren't necessarily going to be taught in an online "How to Java" class?
I'll start out with a few offhand:
Refactoring: great code doesn't usually come from a Beautiful Crystalline Vision that some programmer dreams up -- it comes from writing some code, getting it working, and then rearranging it to make the code *better* while it's still working. That's "refactoring": the art of making the code cleaner without changing what it's doing. It's a good habit to get into, especially because it takes practice. (Granted, listing all the major refactoring techniques is a good-sized talk itself; I highly recommend Fowler's book on the subject.)
The DRY (Don't Repeat Yourself) Principle: which I usually describe as "Duplication is the source of all evil". Any time you are duplicating code, you're making it much more likely that you'll get bugs when things change. Much of refactoring is about merging things to eliminate duplication. Similarly, duplicate data is prone to getting out of sync and causing problems, so you should usually try to point to the same data when it's convenient to do so.
Efficiency is good, but algorithmic complexity is what matters: this is what's often called "Big-O" notation in computer science. How fast things run *does* matter, but only in the grand scheme of things. Whether this approach takes twice as long as that one probably doesn't matter unless you're doing it a bazillion times per second. What *does* tend to matter, given a list of size n, is whether you're going through it just once -- O(n) in the notation -- or whether each time through you're going through the whole list again -- O(n^2) in the notation, that is, "n-squared". (You'd be surprised how easy it to to wind up with algorithms that are n^2 or even n^3 -- that can actually get slow.) Or, if you have two list m and n, does your approach take O(n+m) time, or O(n*m)? It's worth practicing thinking through these order-of-magnitude evaluations and getting an intuition for it. That said...
Big stuff swamps small stuff: in one community the other day, I pointed out an approach to solving a problem that involved creating an extra object for each HTTP call. One of the folks in the discussion asked whether that inefficiency would matter, and I had to point out that you're already handling an HTTP call -- at *best*, the overhead of that handler is at least 1000 times that extra object creation, quite likely 10000 times more, so this is a drop in the bucket. So keep scale in mind, and don't sweat the small stuff. If you know your list is never going to have more than ten entries, even O(n^3) probably doesn't matter much.
What else? Can we craft a reasonably brief Rosetta Stone that summarizes the *common* stuff that every programmer should know, so they know what to look for? What are the principles that are true regardless of programming language, which aren't necessarily taught by the average JavaScript bootcamp? DRY is the heart and soul of good programming IMO -- are there other principles of similar importance?
Along the way, I'm chatting with lots of folks, and a remarkably large fraction lead off with, "Well, I've always been doing X, but I want to learn to code". (Last night's was a fellow who does financial compliance work for one of the large funds.) These folks are usually self-taught, and tend to be very self-deprecating about the fact that they didn't go to school so they don't *really* understand programming. A couple of the programmers I was with and I got chatting about that, and the fact that, yes, the best way to learn to program is by doing. A degree in CS is helpful, but mostly in that it teaches you some of the underlying theory for programming *well*; the nuts and bolts change so often that the details you learn in school will only be useful for a limited time anyway. Somewhere in there, I asserted that you could probably list all of the most-useful bits of theory and practice in one brief talk anyway.
So, here's a challenge: help me figure out what those are. What are the key engineering principles that *every* programmer should know, that probably aren't obvious to a newbie and which aren't necessarily going to be taught in an online "How to Java" class?
I'll start out with a few offhand:
Refactoring: great code doesn't usually come from a Beautiful Crystalline Vision that some programmer dreams up -- it comes from writing some code, getting it working, and then rearranging it to make the code *better* while it's still working. That's "refactoring": the art of making the code cleaner without changing what it's doing. It's a good habit to get into, especially because it takes practice. (Granted, listing all the major refactoring techniques is a good-sized talk itself; I highly recommend Fowler's book on the subject.)
The DRY (Don't Repeat Yourself) Principle: which I usually describe as "Duplication is the source of all evil". Any time you are duplicating code, you're making it much more likely that you'll get bugs when things change. Much of refactoring is about merging things to eliminate duplication. Similarly, duplicate data is prone to getting out of sync and causing problems, so you should usually try to point to the same data when it's convenient to do so.
Efficiency is good, but algorithmic complexity is what matters: this is what's often called "Big-O" notation in computer science. How fast things run *does* matter, but only in the grand scheme of things. Whether this approach takes twice as long as that one probably doesn't matter unless you're doing it a bazillion times per second. What *does* tend to matter, given a list of size n, is whether you're going through it just once -- O(n) in the notation -- or whether each time through you're going through the whole list again -- O(n^2) in the notation, that is, "n-squared". (You'd be surprised how easy it to to wind up with algorithms that are n^2 or even n^3 -- that can actually get slow.) Or, if you have two list m and n, does your approach take O(n+m) time, or O(n*m)? It's worth practicing thinking through these order-of-magnitude evaluations and getting an intuition for it. That said...
Big stuff swamps small stuff: in one community the other day, I pointed out an approach to solving a problem that involved creating an extra object for each HTTP call. One of the folks in the discussion asked whether that inefficiency would matter, and I had to point out that you're already handling an HTTP call -- at *best*, the overhead of that handler is at least 1000 times that extra object creation, quite likely 10000 times more, so this is a drop in the bucket. So keep scale in mind, and don't sweat the small stuff. If you know your list is never going to have more than ten entries, even O(n^3) probably doesn't matter much.
What else? Can we craft a reasonably brief Rosetta Stone that summarizes the *common* stuff that every programmer should know, so they know what to look for? What are the principles that are true regardless of programming language, which aren't necessarily taught by the average JavaScript bootcamp? DRY is the heart and soul of good programming IMO -- are there other principles of similar importance?
(no subject)
Date: 2016-05-11 12:38 pm (UTC)Every program is an attempt to capture and automate a decision-making process. If you don't fully understand the specific process you are working on, nothing else matters.
(no subject)
Date: 2016-05-11 12:45 pm (UTC)(no subject)
Date: 2016-05-11 01:14 pm (UTC)Not refactoring. But design and then code in such a way that you are humble about your assumptions. Don't implement anything "in case I might need it", but implement so that your base ideas can expand.
My second thought is that a good understanding of "software as contract" when building an API is essential, although most early programmers are building programs, not APIs.
One of my former co-workers at SUN defined for me the difference between one system being more powerful than another. If you could implement A by using B, but not implement B using A, B is more powerful. Design for eventual power.
With my professional software testing hat on, "design for error". Generally speaking, more of people's code is built to detect or respond to error than to do the actual task - and if not, the result is often too brittle to use.
(no subject)
Date: 2016-05-11 05:20 pm (UTC)Not sure that I *quite* agree with this one -- a lot of the error-checking is pretty well encapsulated these days, so I typically find it a good deal less than half in terms of SLOC. But this might depend on your definition of "the actual task". (I often find that plumbing is the single largest component, try though I may to minimize that.)
But a spinoff that isn't at all obvious to the beginner is that, if you are building something at all serious, testing is more than half the effort. (And should generally be more than half the total code if your automated testbase is sufficient.)
(no subject)
Date: 2016-05-11 01:09 pm (UTC)(no subject)
Date: 2016-05-11 01:39 pm (UTC)Corollary: anything "clever" or "elegant" needs more commentary, not less.
(no subject)
Date: 2016-05-11 05:21 pm (UTC)(no subject)
Date: 2016-05-11 02:54 pm (UTC)Concurrency: different paradigms, and when one is better than another.
More philosophically:
That programming is, at essence, struggling with the finite ability of the human brain to understand things. Most other principles fall out from this.
The need to balance between subsets of readability, maintainability, compactness, extensibility, execution speed, scalability, safety, data safety, latency, testability, etc. Often multiple axes can be improved simultaneously, but sometimes not, and sometimes it's very hard to figure out how.
The bane of premature or naive optimization of these. Often, coders are taught to always optimize for one (e.g. algorithmic complexity) even when that doesn't matter (e.g. because the domain is too small for O(n lg n) to be worse than O(n lg lg n), and what matters much more is either the big data structure storing that thing, or the density of the code to maintain it).
The inability for a realistic algorithm to cover all of a combinatorically-complicated world. CS-as-taught shows you only toy problems, and pretends that you can prove correctness. (And you can--in limited cases.)
The difference between essential and inessential complexity, especially in the face of the above.
The different demands of code depending on its purpose: data manipulation, user interaction, data management, etc. Too often coders believe the code they write is the only type (net admins think all code is about routing, front-end folks thing all code is about dispatching events, etc.) and that therefore their paradigm is the only valid one.
That programs are necessarily created in service of humans. In the end, some human will need to see benefit, or you don't get paid.
(no subject)
Date: 2016-05-11 05:15 pm (UTC)(I'll have to think about what to say about concurrency beyond "Explicit Threads are your Enemy; avoid them". Beyond that may be getting deeper into the weeds than I want here, especially because concurrency is mostly irrelevant to several major languages. If you're thinking about concurrency, you're probably already at a higher level than I'm focusing on here.)
That programming is, at essence, struggling with the finite ability of the human brain to understand things. Most other principles fall out from this.
Ah, lovely point. Indeed, I think there's a corollary that I totally should point out here, which is that a *large* fraction of serious programming is all about breaking problems down into small, bite-sized pieces that are simple enough that you can be somewhat confident about how they behave, and then using those as building blocks to build larger pieces. Perfection is unlikely, but the goal is being able to understand each component.
The bane of premature or naive optimization of these.
Mmm -- very good point. That may actually be the only point worth making at this level, rather than even worrying about the complexity one.
That programs are necessarily created in service of humans. In the end, some human will need to see benefit, or you don't get paid.
Yaas. The audience I'm thinking about right now (which is mainly entrepreneurs) mostly gets that in their gut, but it's a common enough trap to be worth underlining.
(no subject)
Date: 2016-05-11 02:55 pm (UTC)Comment/document. It is possible to go overboard here, but at least have a written sense of what each function is doing for you.
Recursion. Or more generally, strategies for smartly breaking a problem down into smaller bits. Honestly, this is more to do with understanding your logic more than anything else.
Testing. Ways of looking into what is happening. Dump lots of stuff to stdout now, clean it up when everything is working (and perhaps commented).
(no subject)
Date: 2016-05-11 04:06 pm (UTC)(no subject)
Date: 2016-05-11 04:04 pm (UTC)Start somewhere: People will talk about top down or bottom up; they'll talk about test-driven development, about comments-first development; it doesn't matter as long as you end up with everything you need eventually. The most important thing in order to end up with working code is to have code--and the most important thing there is to start. Write whatever is easiest, when you hit a stopping/changing point, go on to the next thing, whether that's tests or comments or the next piece of code. The hardest place to code from is often a blank file (this ties nicely into refactoring, since the code even after it functions might not be super-efficient, and that's ok).
Computer time is cheaper than programmer time: Efficiency can and often will matter, but for far too many things, it's cheaper -- in terms of effort, and also if money is involved money, to write things in ways that are easier and faster for the programmer than it is to do things that are easier/more efficient on the iron. Sure, you might need that extra bit of speed, but chances are, you wont.
You Won't Need it: That brings us to the great dictum of XP: You Won't need it. Try to favor working code over coding abstractions that you don't need yet, pretty much always. Sure, if an abstraction is easy and the natural place to go next, you can code it in advance of direct necessity. But it's much easier to take working code and refactor it to the more perfect abstraction than to build the abstraction and have it complicate and obfuscate your code long after it becomes obvious that you'll never use it.
ETA: Fail fast. In general, you don't want to code to check assumptions and cover every possible case at every point; that way lies madness. Instead, rule out the impossible and incorrect cases as early as possible in a given branch, then write the rest of the code assuming the inputs are appropriately correct. That way you're dealing with the bad cases in as few places as possible. (Exceptions and exception handling are a special case of fail fast, as using them avoids having to have lots of code everywhere checking for and passing up errors).
(no subject)
Date: 2016-05-11 04:47 pm (UTC)(no subject)
Date: 2016-05-11 05:29 pm (UTC)Heh. Yep, this is one of the ones I've learned painfully over the years. I do allow myself a fair amount of time thinking about the problem, but that hits diminishing returns fairly quickly; at that point, you learn more by starting to actually code.
Try to favor working code over coding abstractions that you don't need yet, pretty much always.
Mixed feelings here, since I violate this one all *over* Querki's codebase. But I have the odd position of being both the Architect and Product Manager, so I know which abstractions I'm *likely* to care about down the line, and are worth spending the time on now. That's an unusual situation.
(no subject)
Date: 2016-05-12 03:52 pm (UTC)(no subject)
Date: 2016-05-12 06:16 pm (UTC)The classic example for me is Querki's identity-management system, which is *wildly* more sophisticated than I need yet; indeed, more sophisticated than nearly any other system I know. But I know where I'm planning on going later this year, and what my long-term objectives are in terms of privacy, and fixing the core abstractions later would have been *extremely* difficult and painful, so it was worth spending the infrastructure time upfront on getting the bones right...
(no subject)
Date: 2016-05-11 04:52 pm (UTC)(no subject)
Date: 2016-05-11 05:30 pm (UTC)(no subject)
Date: 2016-05-12 10:35 am (UTC)(no subject)
Date: 2016-05-12 10:50 am (UTC)Tests need to be FAIR: fast, automatic, independent, and reliable. Fast and automatic, because if tests are too time-consuming or too much of a pain to run, you won’t actually run them, and they’ll do you no good. Independent, because if side effects of one test can change the outcome of other tests, it enlarges the debugging space exponentially (and yes, I mean that literally). Reliable, because if a test sometimes fails for reasons having nothing to do with the program being tested, people lose faith in it and it does you no good.
It's a whole lot easier (both simplicity of code and independence of tests) to write tests for functions whose inputs and outputs are explicit -- ideally, parameters and return values -- than for those that depend on global state, hidden state, or I/O, and/or produce results in global state, hidden state, or I/O. Which leads to the corollary "segregate interesting processing from I/O."
(no subject)
Date: 2016-05-12 11:58 am (UTC)When possible, absolutely so -- indeed, the same reasoning leads you down the road towards functional programming, for generally good reasons.
But I'm particularly conscious of this point this week, having *finally* figured out how to enhance my functional-test harness to deal with Email Interception. One of the problems keeping my functional tests (which try to play scrupulously "fair", mocking as little as possible) too primitive was that the key invitation workflow involves email in the middle of it. Figuring out how to stub that so that the tests could "receive" the email (and parse out the critical link from it) was a real headache. Testing is hard work; functional testing is a *lot* of hard work.
(And indeed, the solution turned out to be exactly a version of "segregate interesting processing from I/O" -- refactoring the email-*sending* code from the email-*generating* code so that the sender could be stubbed in the functional-test environment...)
mocks, etc.
Date: 2016-05-13 03:22 am (UTC)Which means, in turn, that your software has to be parameterized by the LooksLikeANetworkRouter interface: in normal operation you give it a real network router, while for certain kinds of testing you give it a MockNetworkRouter object that can be told how and when to fail.
For another example, my project has lots of code that uses real-world timestamps. It's notoriously difficult to write unit tests involving real-world timestamps, because the real world insists on moving forward in time from one test run to the next If you rely on a particular section of code taking at least or at most a specified period of time, you've lost the "R for reliable"; if you rely on Sleep(num_microseconds) calls, you've lost the "F for fast". The solution is to parameterize the program with an AbstractClock, one implementation of which is the real system clock, and another implementation is a MockClock that can be read, set, advanced by various amounts, etc.
Which brings us to another rule of thumb that I thought somebody had already mentioned in this thread but I don't see now: any object you get from outside your code (whether passed in as a parameter, returned by a factory method, etc.) should be used according to its interface, not its implementation. For an extreme example, the "factorial" function really should have a parameter of type LooksLikeANaturalNumber, as long as that interface has IsZero, Predecessor, and Multiply functions.
And which reminds me of another rule of testing: for every test, ask yourself where a bug would have to be in order for this test to expose it. If you already have tests that would expose a bug in this specific place, you probably don't need another; conversely, if you have large sections of code in which a bug could hide without any of your tests exposing it, you need more tests. Mocks are good for detecting bugs in the parts of your code that directly interface with the external system that you're mocking.
Re: mocks, etc.
Date: 2016-05-13 12:50 pm (UTC)I have a standard architectural pattern that I pretty much always use for major programs -- I originally learned it from Tom Leonard at Looking Glass, and then coined the name The Ecology Pattern after I started using it elsewhere. I've written that particular framework in four different languages, as the basis of half a dozen companies, over the past 15 years.
It has several advantages (it was originally developed by Tom mainly to keep C++ compile times decent, and I like it because it's much more rigorous about initialization than many approaches), but one of them is making mocking extremely easy -- it's a variation of the Dependency Injection approach, and insists on a strict separation of interface from implementation. So testing *always* involves building your Ecology out of the real implementations of the components under test, stubbing the interfaces you can ignore, and mocking the ones you want to instrument. Very useful general approach to the world.
And which reminds me of another rule of testing: for every test, ask yourself where a bug would have to be in order for this test to expose it.
Which is related to a general point worth making: debugging is the one part of programming that is actually *science*, and it's worth being rigorously scientific about it. Observe the bug in action; formulate hypotheses; build tests that could prove (and more importantly, disprove) those hypotheses; and see what happens.
I think any experienced programmer knows this deep in their gut, but it's not at all obvious until you've been doing it for a while...
Resiliency trumps simplistic notions of function
Date: 2016-05-13 11:34 am (UTC)Example: Not too long ago my team was devising a new server-side message logging function. I gave the developers direction to save off messages to intermediate SAN storage -- a pretty reliable thing -- and then have a separate background task commit the messages to the destination DB. The reply: Oh no, we can't have people wait to have their messages appear in the log. This was foolish. If the DB was down for maintenance, the entire system would be unusable!
That thinking comes from the old, "enterprise software" way of thinking, where typically everything goes down for maintenance all at one time. A modern cloud system however, needs to maintain as much uptime for as many features as possible, and further needs to assume any dependent system component might be unavailable at any time. By using an intermediate storage approach, the messaging app could still function, even if the persistent store was offline. The actual delay anyone would realize in seeing logged messages would be minuscule anyway, like 5 seconds at worst.
(no subject)
Date: 2016-05-14 08:09 pm (UTC)(no subject)
Date: 2016-05-14 09:43 pm (UTC)(no subject)
Date: 2016-05-15 07:09 am (UTC)(no subject)
Date: 2016-05-15 08:22 am (UTC)So instead of:
if ($foo == 'bar' || (count($biz) > 2 && $biz[2] = 'bax'))
break down to:
$second_biz_is_bax = count($biz) > 2 && $biz[2] = 'bax';
if ($foo == 'bar' || $second_biz_is_bax) {}
The main reason is it's easier to read. I like to state this as 'I'm not a computer -- the computer is a computer' -- having to parse long expressions like that to understand their purpose is a waste of human brain power.
Also, it's much less effort to debug when you're trying to work out why the condition is failing, as you can dump the output of $second_biz_is_bax without copy-pasting actual code. (It may be that people with fancy debuggers can get that anyway, but I don't have a fancy debugger...)