[This weird ramble is kind of about programming, but this time it's introductory-level and generally useful, instead of the wizard-level stuff I usually talk about. We are going to teach a few basic programming concepts using my comic book collection as a motivating real-world example.]
There -- that project's been underway for about three years, and at least it's in decent shape before things have to go into storage.
I occasionally refer to Steve, the proprietor of The Outer Limits in Waltham, as my Comic Book Pusher. Some people think I'm kidding, but it's more a matter than I decided many years ago that comics are a less dangerous (if not necessarily cheaper) addiction than cocaine.
But the thing is, I am very much *not* a comic book "collector". I don't buy for value, or even completeness -- I just like reading them. So I tear through huge numbers of comics, and then set them aside. And once in a while, it occurs to me to sort them as I put them into boxes and stick them in the basement, so that later I might be able to re-read the best ones.
The problem is, while sorting a *few* comics (say, a few hundred) is easy, sorting many thousands of them is not. Hence, the last time I did a full sort of the entire pile was rather a while ago. "A while ago" being defined, in this case, as 1990. This is a Problem.
This is where the computer science degree comes in. The most important thing you learn in programming classes isn't how to program -- that becomes obsolete rather quickly as systems and languages change -- but rather, how to analyze algorithms. And the very first thing you learn is how fast the various sort algorithms run.
Most people, when given a bunch of things to sort, use some kind of Insertion Sort -- that is, you just put things in order, sticking them into place as you get to them. This works great for sorting anywhere up to about 50 comics, but once you get past about a handspan in width it slows down dramatically, because it starts taking a long time to find the right place to put each one. And in fact, we are taught in school that an insertion sort is "n squared" -- the amount of time it takes is proportional to the number of things to sort, squared. When you're sorting tens of thousands of comic books, that is a very long time indeed. (Sorting ten thousand comics this way takes literally a million times as long as sorting ten.)
The canonical fastest sorting algorithm is known, quite reasonably, as the Quick Sort. Conceptually, it's pretty straightforward. You take your pile of things, and create two buckets around a midpoint: in our case, the buckets would be "comics from A-M" and "comics from N-Z". Then repeat this for each bucket, so you'd wind up with four buckets in order: A-F, G-M, N-S, T-Z. Keep repeating until each bucket has only one comic book and *poof* -- it's all sorted! This works *great* in the computer -- indeed, in some programming languages it's basically a one-liner -- and it is "n log n": the amount of time it takes to sort everything is proportional to the number of items times the logarithm of that number (which is relatively small). This is more or less as fast as you can theoretically go. It has only one problem: it requires n buckets, and I do not have tens of thousands of tiny comic book boxes.
So for real-world problems, we have the Merge Sort, which isn't *quite* as fast as Quick Sort, but is still basically "n log n". For a merge sort, you start out by doing an insertion sort (just putting things in order) for as many things as you can easily do (in our case, about a handspan's worth of comics). Set those aside, and do it again for the next batch; repeat until everything is in little piles. Now merge together a reasonable number of piles -- take around 3-6 piles of comics, and just go from front to back, putting them together, which is extremely quick and gets you a *bigger* pile. Do that for all the small piles, so you now have some big piles. Now do the same thing for the big piles, so you wind up with one really big, completely sorted pile.
(Of course, none of this is a deep dark secret -- good librarians know this technique just as well as good programmers. But it occurred to me that many folks probably don't have cause to sort thousands of items very often, and might not know the trick.)
So when I've read 3-6 months' of comics and put them away, I typically do about two passes of this, so that I wind up with 1-3 longboxes, sorted and merged in order. What I *haven't* done since 1990 is continuing the process: take these now rather large piles, and keep merging them.
But everything is going into storage shortly, which means that all hope of *ever* seeing the collection fully sorted is Doomed Doomed Doomed if I don't make progress. So a few months ago I kicked back into motion the project that I had started before Jane died, to fully merge the whole thing. Sadly, I didn't get it all the way complete, and I need to stop now and focus on packing. But I've gotten to the point where I now have three *humongous* piles of longboxes, labeled runs A, B and C, which represent all the comics since 1990. After the move is done, I can begin to pull those out of storage, along with the pre-1990 run, merge the whole thing, and craft a really serious Querki app to inventory and sell most of it. (My comics are going to be one of the acid tests for Querki, and will help drive several generally interesting features.)
And the final result? I have 68 longboxes in the post-1990 run, along with 30-40 in the pre-1990 one, already in storage. In total, I'd guess that that's about 30,000 comics, enough that the collection *must* be kept directly on the slab, lest it break the house. A fair addiction, indeed...
There -- that project's been underway for about three years, and at least it's in decent shape before things have to go into storage.
I occasionally refer to Steve, the proprietor of The Outer Limits in Waltham, as my Comic Book Pusher. Some people think I'm kidding, but it's more a matter than I decided many years ago that comics are a less dangerous (if not necessarily cheaper) addiction than cocaine.
But the thing is, I am very much *not* a comic book "collector". I don't buy for value, or even completeness -- I just like reading them. So I tear through huge numbers of comics, and then set them aside. And once in a while, it occurs to me to sort them as I put them into boxes and stick them in the basement, so that later I might be able to re-read the best ones.
The problem is, while sorting a *few* comics (say, a few hundred) is easy, sorting many thousands of them is not. Hence, the last time I did a full sort of the entire pile was rather a while ago. "A while ago" being defined, in this case, as 1990. This is a Problem.
This is where the computer science degree comes in. The most important thing you learn in programming classes isn't how to program -- that becomes obsolete rather quickly as systems and languages change -- but rather, how to analyze algorithms. And the very first thing you learn is how fast the various sort algorithms run.
Most people, when given a bunch of things to sort, use some kind of Insertion Sort -- that is, you just put things in order, sticking them into place as you get to them. This works great for sorting anywhere up to about 50 comics, but once you get past about a handspan in width it slows down dramatically, because it starts taking a long time to find the right place to put each one. And in fact, we are taught in school that an insertion sort is "n squared" -- the amount of time it takes is proportional to the number of things to sort, squared. When you're sorting tens of thousands of comic books, that is a very long time indeed. (Sorting ten thousand comics this way takes literally a million times as long as sorting ten.)
The canonical fastest sorting algorithm is known, quite reasonably, as the Quick Sort. Conceptually, it's pretty straightforward. You take your pile of things, and create two buckets around a midpoint: in our case, the buckets would be "comics from A-M" and "comics from N-Z". Then repeat this for each bucket, so you'd wind up with four buckets in order: A-F, G-M, N-S, T-Z. Keep repeating until each bucket has only one comic book and *poof* -- it's all sorted! This works *great* in the computer -- indeed, in some programming languages it's basically a one-liner -- and it is "n log n": the amount of time it takes to sort everything is proportional to the number of items times the logarithm of that number (which is relatively small). This is more or less as fast as you can theoretically go. It has only one problem: it requires n buckets, and I do not have tens of thousands of tiny comic book boxes.
So for real-world problems, we have the Merge Sort, which isn't *quite* as fast as Quick Sort, but is still basically "n log n". For a merge sort, you start out by doing an insertion sort (just putting things in order) for as many things as you can easily do (in our case, about a handspan's worth of comics). Set those aside, and do it again for the next batch; repeat until everything is in little piles. Now merge together a reasonable number of piles -- take around 3-6 piles of comics, and just go from front to back, putting them together, which is extremely quick and gets you a *bigger* pile. Do that for all the small piles, so you now have some big piles. Now do the same thing for the big piles, so you wind up with one really big, completely sorted pile.
(Of course, none of this is a deep dark secret -- good librarians know this technique just as well as good programmers. But it occurred to me that many folks probably don't have cause to sort thousands of items very often, and might not know the trick.)
So when I've read 3-6 months' of comics and put them away, I typically do about two passes of this, so that I wind up with 1-3 longboxes, sorted and merged in order. What I *haven't* done since 1990 is continuing the process: take these now rather large piles, and keep merging them.
But everything is going into storage shortly, which means that all hope of *ever* seeing the collection fully sorted is Doomed Doomed Doomed if I don't make progress. So a few months ago I kicked back into motion the project that I had started before Jane died, to fully merge the whole thing. Sadly, I didn't get it all the way complete, and I need to stop now and focus on packing. But I've gotten to the point where I now have three *humongous* piles of longboxes, labeled runs A, B and C, which represent all the comics since 1990. After the move is done, I can begin to pull those out of storage, along with the pre-1990 run, merge the whole thing, and craft a really serious Querki app to inventory and sell most of it. (My comics are going to be one of the acid tests for Querki, and will help drive several generally interesting features.)
And the final result? I have 68 longboxes in the post-1990 run, along with 30-40 in the pre-1990 one, already in storage. In total, I'd guess that that's about 30,000 comics, enough that the collection *must* be kept directly on the slab, lest it break the house. A fair addiction, indeed...
(no subject)
Date: 2013-03-20 08:03 pm (UTC)(no subject)
Date: 2013-03-20 08:06 pm (UTC)(no subject)
Date: 2013-03-20 08:38 pm (UTC)This covers some fairly straightforward problems -- eg, Welcome to Tranquility is filed under "T", which just needs an arbitrary but consistent answer -- but some that require at least a solid understanding of what the book *is*. For example, that Adventure Comics is mostly sorted under "S" for "Superman" -- and indeed, that there is a ten-year run when it gets interfiled with Superman, Action and Man of Steel, since they essentially comprised a single weekly comic -- but old issues might be under "L" for "Legion of Super-Heroes".
And of course, then there's the totally impossible ones unless you know the comics, like the fact that Strange Kiss, Strange Kisses and Strange Killings all get filed under "G" -- because that series of mini-series eventually resolved into the ongoing story Gravel. Or the way that Martha Washington Goes to War also gets filed under "G" -- because it is a sequel to the classic story Give Me Liberty.
Truth to tell, even I have trouble keeping track of it sometimes. Hence, one of the decisions I made *many* years ago was to begin crafting the comic book database, initially as an "authority list" (I think that's the correct librarian jargon) to record my alphabetization decisions. So the Series table has both Title (the readable title) and Sorted As (which may have precious little to do with the cover). And individual Issues have a Sorted Under field, to record cases where, eg, I only bought one issue of this comic because it was crossing over with something I actually gave a damn about. Most importantly, though, it means that when I find myself trying to remember what I decided, I can pull up the database and at least be consistent...
(no subject)
Date: 2013-03-20 09:54 pm (UTC)These days I might not bother sorting the physical objects directly...
(no subject)
Date: 2013-03-20 11:00 pm (UTC)Digital materials? We have computers to do the searching for us, and so content or metadata searches make sorting a far more ignorable task.
But when your optimal storage is in large densities (comic book boxes hold many, many comics) you still can't just use digital records to find things, because you can only get down to a certain level of granularity. 'Box 32, last third' might be about it.
(no subject)
Date: 2013-03-21 12:43 am (UTC)(no subject)
Date: 2013-03-20 11:27 pm (UTC)Good point about the barcodes, though. The older ones don't have them, of course, so I never got into the habit of using them. Might be an interesting long-term project to figure out an appropriate use for them.
(If the hardware's available. Has somebody built an affordable and easy-to-use version of the CueCat yet?)
(no subject)
Date: 2013-03-21 12:09 am (UTC)(no subject)
Date: 2013-03-21 12:15 am (UTC)If a large portion of the collection doesn't have bar codes, then I doubt the automation suggestions will help much. Maybe if Google Goggles is a lot smarter than I think it is, but that's a long shot.
(no subject)
Date: 2013-03-21 12:30 pm (UTC)So I'd guess, offhand, that 60% has barcodes. A lot, but by no means the overwhemling majority...
(no subject)
Date: 2013-03-21 12:20 am (UTC)[I am highly impressed by the "Book Catalogue" app that I use on my Android devices. It scans barcodes to add books, retrieves information from various sources including Amazon and LibraryThing, allows me to physically notate location (by hand, so I don't do that), and places entries on virtual bookshelves.]
(no subject)
Date: 2013-03-21 06:28 am (UTC)(no subject)
Date: 2013-03-21 12:32 pm (UTC)And useful to know that there is a good app that talks to LibraryThing -- that might be quite helpful for me, at least for the newer books...
(no subject)
Date: 2013-03-20 08:23 pm (UTC)Dude your protestations to the contrary, like it or not, you ARE a collector.
...or at least a comic book hoarder. :)
(no subject)
Date: 2013-03-20 08:47 pm (UTC)I suspect my acquisitions will begin to taper off in the next few years, as tablets and apps get good enough that there is no longer any strong reason to buy most comics in paper form. And if I'm *really* lucky, Marvel will decide to do something as irritating as DC did, convincing me to drop the entire line -- that would help a good deal...
(no subject)
Date: 2013-03-20 09:34 pm (UTC)You hoard comics.
I hoard SF books.
Let's face facts. Me, I avoid facts - that's why I still have the damned books. :-)
(no subject)
Date: 2013-03-20 10:45 pm (UTC)(no subject)
Date: 2013-03-20 11:02 pm (UTC)(no subject)
Date: 2013-03-22 06:24 pm (UTC)(says the very occasional comic-book reader. As a matter of fact, I think that any comics I HAVE read, I borrowed from you.)
(no subject)
Date: 2013-03-23 02:50 pm (UTC)The first time this happened (20-some years ago), it seemed an innovative solution to having so much continuity to deal with. But it didn't work well then -- the seams always showed, ever after -- and by now it just looks cynical.
So since I was already pretty disenchanted with DC -- they still have some good writers, but their editorial staff have been absolute *crap* for a long time -- I took this as the opportunity to drop the entire line flat. I'm still buying a lot from Vertigo, their higher-quality line, but I'm avoiding all of core DC. (Including the new Constantine book: Vertigo cancelled Hellblazer, a longtime favorite horror comic of mine, and reinvented it as essentially a superhero book. Ick...)
As a non-trained programmer...
Date: 2013-03-20 08:29 pm (UTC)Not only did you teach me something, I now more fully appreciate Monday's xkcd cartoon.
Re: As a non-trained programmer...
Date: 2013-03-20 08:42 pm (UTC)Re: As a non-trained programmer...
Date: 2013-03-20 09:49 pm (UTC)Re: As a non-trained programmer...
Date: 2013-03-20 11:30 pm (UTC)(Of course, it was also the most gruelling class I took in college, by design. That's the class where I got a 45 on the final exam -- and got scolded by my classmates for blowing the curve and stealing the A. *Tough* exam...)
Re: As a non-trained programmer...
Date: 2013-03-20 11:53 pm (UTC)Re: As a non-trained programmer...
Date: 2013-03-21 12:25 pm (UTC)(no subject)
Date: 2013-03-21 01:09 pm (UTC)Some course professor must have been impressed by the weight of the material instead of its utility.
(no subject)
Date: 2013-03-21 02:40 pm (UTC)The difficulty was mainly that the professor crammed a *ton* of knowledge into a one-semester class -- it started with the basics of big-O notation (and the other, less common notations), and then went through how to build, use and analyze essentially every major data structure at blazing speed.
Enormously valuable stuff (arguably more *useful* than the entire rest of my college classes combined), and I suspect that nowadays I'd find most of it pretty obvious. But when you have a class full of student hackers, none of them having much previous formal training in any of this, it was a serious slap of cold reality...
(no subject)
Date: 2013-03-20 09:45 pm (UTC)Back when I was in college, I got a summer job in the office of a car dealership in the South Bronx. The first job I had was to review and sort lists of VIN's in stock, in order to compare them against the list provided by the manufacturer. This task was done quarterly, and was much disliked by the regular staff.
They gave it to me.
Fresh from a computer algorithms class, I built a plan, and assaulted the problem. Lunch was at noon: I stayed until 12:15 to finish the job. I left for lunch.
I returned at 1. To discover that I had been "fired" from the office job. It seems that it had taken the regular staff at least a week to do it. At first they had doubted my results. When my results were correct, they insisted I be removed from the office for "making them look bad".
My new job in the dealership? They had been collecting physical wheel well moldings for about 50 years, and they were somewhat randomly emplaced in the attic of the dealership's parts department. I was given a sawed off baseball bat for the rats (and there were rats) and sent up there in the summer heat to sort the moldings.
Slow learner me: I sorted the moldings, in about a week. (Merge sort works for ungainly physical objects too.)
Parts people, who had always refused the job because it was too hard and would take too long, had me fired: I made them look bad.
I was demoted, again, to the garage. Where it became my job to sweep the floors and wash the cars. There was no way to make people look bad doing that - although I did optimize the pattern for sweeping - but since it took all day anyway, I just did a better job than the other guys had.
Still: they weren't happy with that. I made them look bad...
Repeat until they found a job so impossible, that I could not do it. Then they insisted I do it: we had just rented a 2 acre field filled with weeds. In order to park cars there, I had to clear that field. With a dull machete. I sharpened the machete, and bought some gloves. But within a few hours my hands were too blistered to work.
The owner insisted I continue. I went out there, worked until my hands were bloody, and came into his office. I removed my gloves, left two bloody handprints on the old-fashioned desk blotter, and said "I quit".
Some things you can't optimize away.
(There were some consequences for the owner, however...)
(no subject)
Date: 2013-03-20 11:05 pm (UTC)(no subject)
Date: 2013-03-20 10:18 pm (UTC)(no subject)
Date: 2013-03-20 11:34 pm (UTC)Fortunately, most modern libraries have decently good implementations of these algorithms, at least for medium-sized use. But this sort of thing is one of the numerous reasons why Querki is specifically focused on and limited to *small* datasets initially. I trust myself to be able to write a good system for managing 50k objects, but want serious DB programmers on-board before we try to tackle enterprise-scale data...
(no subject)
Date: 2013-03-20 11:24 pm (UTC)Computational Fairy Tales - http://is.gd/yWf7Av
Best Practices of Spell Design - http://is.gd/ALQEM4
(no subject)
Date: 2013-03-20 11:35 pm (UTC)(no subject)
Date: 2013-03-21 06:35 am (UTC)(no subject)
Date: 2013-03-21 12:36 pm (UTC)Now that I'm in Somerville, Outer Limits is getting a little inconvenient for me, truth to tell; I may eventually have to switch stores. But I'm in no rush -- it's a fine thing to have a longtime relationship with a good dealer, and I've been shopping at Steve's for something like 25 years now...
(no subject)
Date: 2013-03-21 01:19 pm (UTC)*Sniff*
Date: 2013-03-23 02:55 pm (UTC)I pointed out that she could do a bucket sort: Deal out all the cards into piles "starts with A", "starts with B", etc; get each pile sorted; and then at the end just stack all the piles on top of each other. She was gob-smacked, and told me "You've changed the way I will sort things for the rest of my life."
I hadn't thought of that day in years... :-)
Re: *Sniff*
Date: 2013-03-23 02:59 pm (UTC)(Who was in many ways an untrained programmer: she never studied the stuff, but *thought* in the right ways, and had a knack for picking up techniques that were useful. My rule of thumb was that I would usually intuit a solution quicker than her, but once I pointed it out she would always apply it much more rigorously and correctly than me ever-after. This is why I would usually win a game the first few times, but she'd usually wind up winning far more often in the long run...)
(no subject)
Date: 2013-03-24 03:31 am (UTC)If the ultimate goal is to sell them,
a) don't do any more sorting unless you enjoy it more than anything else you could do with that time.
b) when you're ready to sell, put the comics up one box at a time, *indexing* but not necessarily sorting each box as you do so (though you can sort within a box while listing them if you want to). Include markers to make it easier to find a section within a box.
c) repeat until done.
All remaining sorting takes place on the computer; fetching requires referring to multiple indices -- but you don't have to fetch something until it's actually going to sell, so you have a reward for the work of fetching it.
(no subject)
Date: 2013-03-24 02:30 pm (UTC)And the indexing approach you're describing actually isn't as much of a win as you think, because I'm going to be adding a feature to Querki to make indexing much faster for coherent runs than individual issues -- basically, make it easy to say "add Batman 287-322" as a single entry. Also, the effort of *retrieving* a run in order to sell it (because let's get real, it is much harder to sell back issues if they aren't sold as coherent blocks) would be much larger, simply because of the effort of physically shifting all those boxes.
I haven't done a formal analysis of it, but I *think* that, from an overall efficiency POV, the pre-sort is still a win. Not nearly as huge a win going from four runs to one as it was when I had dozens of runs, but still a win.
Most importantly, though, I might as well. I still have to inventory everything, and separate it into the three big buckets ("definitely sell", "definitely keep" and "reconsider later"), and it's actually not much more work to do the final merge while that is happening: take all four remaining runs, and simultaneously merge them, inventory them, and re-separate them. That will have the optimal final result, and isn't much harder than handling the runs separately...
(no subject)
Date: 2013-03-24 03:44 pm (UTC)So, altering my suggestion to face reality, you could do your final sort on one box (dividing it into three buckets in the process, start listing the appropriate bucket (fully sorted) immediately, and then merge other boxes into the three buckets gradually at your leisure. Some runs would get broken up in the process, but you'll still have the incenticization of seeing sales start happening -- or, if you don't see any sales, you might decide that it's a bad plan and move in a different direction.
I agree with Golden Square, you're acting like a hoarder; having decided to get rid of the comics, you've chosen the slowest possible path to actually doing so (this is the pot calling the kettle black, of course) -- could I interest you in some old-but-not-antique tomes on medieval philosophy?