jducoeur: (Default)
[personal profile] jducoeur
[This weird ramble is kind of about programming, but this time it's introductory-level and generally useful, instead of the wizard-level stuff I usually talk about. We are going to teach a few basic programming concepts using my comic book collection as a motivating real-world example.]

There -- that project's been underway for about three years, and at least it's in decent shape before things have to go into storage.

I occasionally refer to Steve, the proprietor of The Outer Limits in Waltham, as my Comic Book Pusher. Some people think I'm kidding, but it's more a matter than I decided many years ago that comics are a less dangerous (if not necessarily cheaper) addiction than cocaine.

But the thing is, I am very much *not* a comic book "collector". I don't buy for value, or even completeness -- I just like reading them. So I tear through huge numbers of comics, and then set them aside. And once in a while, it occurs to me to sort them as I put them into boxes and stick them in the basement, so that later I might be able to re-read the best ones.

The problem is, while sorting a *few* comics (say, a few hundred) is easy, sorting many thousands of them is not. Hence, the last time I did a full sort of the entire pile was rather a while ago. "A while ago" being defined, in this case, as 1990. This is a Problem.


This is where the computer science degree comes in. The most important thing you learn in programming classes isn't how to program -- that becomes obsolete rather quickly as systems and languages change -- but rather, how to analyze algorithms. And the very first thing you learn is how fast the various sort algorithms run.

Most people, when given a bunch of things to sort, use some kind of Insertion Sort -- that is, you just put things in order, sticking them into place as you get to them. This works great for sorting anywhere up to about 50 comics, but once you get past about a handspan in width it slows down dramatically, because it starts taking a long time to find the right place to put each one. And in fact, we are taught in school that an insertion sort is "n squared" -- the amount of time it takes is proportional to the number of things to sort, squared. When you're sorting tens of thousands of comic books, that is a very long time indeed. (Sorting ten thousand comics this way takes literally a million times as long as sorting ten.)

The canonical fastest sorting algorithm is known, quite reasonably, as the Quick Sort. Conceptually, it's pretty straightforward. You take your pile of things, and create two buckets around a midpoint: in our case, the buckets would be "comics from A-M" and "comics from N-Z". Then repeat this for each bucket, so you'd wind up with four buckets in order: A-F, G-M, N-S, T-Z. Keep repeating until each bucket has only one comic book and *poof* -- it's all sorted! This works *great* in the computer -- indeed, in some programming languages it's basically a one-liner -- and it is "n log n": the amount of time it takes to sort everything is proportional to the number of items times the logarithm of that number (which is relatively small). This is more or less as fast as you can theoretically go. It has only one problem: it requires n buckets, and I do not have tens of thousands of tiny comic book boxes.

So for real-world problems, we have the Merge Sort, which isn't *quite* as fast as Quick Sort, but is still basically "n log n". For a merge sort, you start out by doing an insertion sort (just putting things in order) for as many things as you can easily do (in our case, about a handspan's worth of comics). Set those aside, and do it again for the next batch; repeat until everything is in little piles. Now merge together a reasonable number of piles -- take around 3-6 piles of comics, and just go from front to back, putting them together, which is extremely quick and gets you a *bigger* pile. Do that for all the small piles, so you now have some big piles. Now do the same thing for the big piles, so you wind up with one really big, completely sorted pile.

(Of course, none of this is a deep dark secret -- good librarians know this technique just as well as good programmers. But it occurred to me that many folks probably don't have cause to sort thousands of items very often, and might not know the trick.)


So when I've read 3-6 months' of comics and put them away, I typically do about two passes of this, so that I wind up with 1-3 longboxes, sorted and merged in order. What I *haven't* done since 1990 is continuing the process: take these now rather large piles, and keep merging them.

But everything is going into storage shortly, which means that all hope of *ever* seeing the collection fully sorted is Doomed Doomed Doomed if I don't make progress. So a few months ago I kicked back into motion the project that I had started before Jane died, to fully merge the whole thing. Sadly, I didn't get it all the way complete, and I need to stop now and focus on packing. But I've gotten to the point where I now have three *humongous* piles of longboxes, labeled runs A, B and C, which represent all the comics since 1990. After the move is done, I can begin to pull those out of storage, along with the pre-1990 run, merge the whole thing, and craft a really serious Querki app to inventory and sell most of it. (My comics are going to be one of the acid tests for Querki, and will help drive several generally interesting features.)

And the final result? I have 68 longboxes in the post-1990 run, along with 30-40 in the pre-1990 one, already in storage. In total, I'd guess that that's about 30,000 comics, enough that the collection *must* be kept directly on the slab, lest it break the house. A fair addiction, indeed...

(no subject)

Date: 2013-03-20 08:03 pm (UTC)
From: [identity profile] etherial.livejournal.com
Reminds me of the complex and arduous task of sorting CCG cards, which is further complicated by the fact that you want near-instant access to any given card *and* are constantly taking cards *out* of order.

(no subject)

Date: 2013-03-20 08:06 pm (UTC)
From: [identity profile] be-well-lowell.livejournal.com
What is the ordering relation?

(no subject)

Date: 2013-03-20 09:54 pm (UTC)
From: [identity profile] be-well-lowell.livejournal.com
They all have bar codes, right? And you can map those into a reasonable set of metadata, I would guess?

These days I might not bother sorting the physical objects directly...

(no subject)

Date: 2013-03-20 11:00 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
Ah, but the physical sorting is important for later retrieval of physical objects. You don't want to have to scan 15000 barcodes to find the one comic you are looking for.

Digital materials? We have computers to do the searching for us, and so content or metadata searches make sorting a far more ignorable task.

But when your optimal storage is in large densities (comic book boxes hold many, many comics) you still can't just use digital records to find things, because you can only get down to a certain level of granularity. 'Box 32, last third' might be about it.

(no subject)

Date: 2013-03-21 12:43 am (UTC)
From: [identity profile] be-well-lowell.livejournal.com
Not necessarily: if you scan the books in order as you put them into one box after another, the database will know exactly how far into which box a given bar code is. The physical location should be *part* of the metadata.
Edited Date: 2013-03-21 12:45 am (UTC)

(no subject)

Date: 2013-03-21 12:09 am (UTC)
From: [identity profile] be-well-lowell.livejournal.com
Physical location is just another element in the metadata. You scan each book as you put it into box y, and "x from the front in box y" is part of the record for that book. This requires a little discipline to maintain, but not really a whole lot: if you take a book out (to, you know, read, for example), all you have to do is re-scan it before putting it away (and you don't even have to put it back where it came from). Duplicates cause problems, but handling that is a minor additional feature.

(no subject)

Date: 2013-03-21 12:15 am (UTC)
From: [identity profile] be-well-lowell.livejournal.com
How much of your collection is from before bar codes? This is complicated, because for a while (I'm not sure how long, but it was widely true in 1992 when I moved to Massachusetts and resumed the student life for a couple of years) comic book shops got issues without bar codes, and other sellers got them weeks later, identical except for having bar codes on the cover.

If a large portion of the collection doesn't have bar codes, then I doubt the automation suggestions will help much. Maybe if Google Goggles is a lot smarter than I think it is, but that's a long shot.

(no subject)

Date: 2013-03-21 12:20 am (UTC)
From: [identity profile] be-well-lowell.livejournal.com
Finally: um, bar code scanners? You mean you don't have a smartphone with a camera?

[I am highly impressed by the "Book Catalogue" app that I use on my Android devices. It scans barcodes to add books, retrieves information from various sources including Amazon and LibraryThing, allows me to physically notate location (by hand, so I don't do that), and places entries on virtual bookshelves.]

(no subject)

Date: 2013-03-21 06:28 am (UTC)
From: [identity profile] elektra.livejournal.com
Smartphone camera is waaay too slow for database lookup. My cuecat interfaces directly with my library program (for books, music or dvds).

(no subject)

Date: 2013-03-20 08:23 pm (UTC)
From: [identity profile] dragonazure.livejournal.com
100+ longboxes?

Dude your protestations to the contrary, like it or not, you ARE a collector.

...or at least a comic book hoarder. :)
Edited Date: 2013-03-20 08:27 pm (UTC)

(no subject)

Date: 2013-03-20 09:34 pm (UTC)
From: [identity profile] goldsquare.livejournal.com
A reader reads and discards or keeps a very select few.

You hoard comics.

I hoard SF books.

Let's face facts. Me, I avoid facts - that's why I still have the damned books. :-)

(no subject)

Date: 2013-03-20 10:45 pm (UTC)
From: [identity profile] serakit.livejournal.com
*waves* Another book hoarder! I just hoard straight-up *books*, with no qualifiers.

(no subject)

Date: 2013-03-20 11:02 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
I'm not convinced there is geek cred in having the most comic books. There is geek cred in having the best curated collection, probably. Or in having the broadest or deepest knowledge or experience in comic books. Sheer numbers do not impress me any more though.

(no subject)

Date: 2013-03-22 06:24 pm (UTC)
From: [identity profile] ladymacgregor.livejournal.com
. . . what did DC do?

(says the very occasional comic-book reader. As a matter of fact, I think that any comics I HAVE read, I borrowed from you.)

As a non-trained programmer...

Date: 2013-03-20 08:29 pm (UTC)
From: [identity profile] andrea habura (from livejournal.com)
a sort of "hedge wizard" coder, if you will: thank you *very* much for that explanation of sorting algorithms. I knew this trick, just not its programming application.

Not only did you teach me something, I now more fully appreciate Monday's xkcd cartoon.
Edited Date: 2013-03-20 08:30 pm (UTC)

Re: As a non-trained programmer...

Date: 2013-03-20 09:49 pm (UTC)
From: [identity profile] be-well-lowell.livejournal.com
I actually grabbed Justin's copy of Aho/Hopcroft/Ullman ("Data Structures and Algorithms") on one of his shelf-clearing adventures a decade of so back. I didn't have a decent book on the subject, not having ever gotten a CS degree myself. I don't remember ever using it.

Re: As a non-trained programmer...

Date: 2013-03-20 11:53 pm (UTC)
From: [identity profile] be-well-lowell.livejournal.com
I did take a data structures class in graduate school, but I had several years of professional experience by then. Like most of the other people in the class. For most of us, it was a semester-long experience of discovering that we actually knew the material without having previously realized it. But the textbook was terrible, which is why I'm happy to have your copy of Aho on my shelf -- whereas you high-level programmers depend on system routines to get the platform details right, I am sometimes the one having to write the system routine.

(no subject)

Date: 2013-03-21 01:09 pm (UTC)
From: [identity profile] dragonazure.livejournal.com
There seems to have been two schools of thought about how much to write about data structures when I was going through my courses. One book I had was a rather simple, straightforward book, and the other was a ponderous tome. I kept both. The tome makes a nice decorative element to our bookshelves, but the plastic-coated one that had been repeatedly resold (at higher and higher prices at each resale) was the more useful of the two....

Some course professor must have been impressed by the weight of the material instead of its utility.

(no subject)

Date: 2013-03-20 09:45 pm (UTC)
From: [identity profile] goldsquare.livejournal.com
Beware... old personal story, but the moral is ALSO beware.

Back when I was in college, I got a summer job in the office of a car dealership in the South Bronx. The first job I had was to review and sort lists of VIN's in stock, in order to compare them against the list provided by the manufacturer. This task was done quarterly, and was much disliked by the regular staff.

They gave it to me.

Fresh from a computer algorithms class, I built a plan, and assaulted the problem. Lunch was at noon: I stayed until 12:15 to finish the job. I left for lunch.

I returned at 1. To discover that I had been "fired" from the office job. It seems that it had taken the regular staff at least a week to do it. At first they had doubted my results. When my results were correct, they insisted I be removed from the office for "making them look bad".

My new job in the dealership? They had been collecting physical wheel well moldings for about 50 years, and they were somewhat randomly emplaced in the attic of the dealership's parts department. I was given a sawed off baseball bat for the rats (and there were rats) and sent up there in the summer heat to sort the moldings.

Slow learner me: I sorted the moldings, in about a week. (Merge sort works for ungainly physical objects too.)

Parts people, who had always refused the job because it was too hard and would take too long, had me fired: I made them look bad.

I was demoted, again, to the garage. Where it became my job to sweep the floors and wash the cars. There was no way to make people look bad doing that - although I did optimize the pattern for sweeping - but since it took all day anyway, I just did a better job than the other guys had.

Still: they weren't happy with that. I made them look bad...

Repeat until they found a job so impossible, that I could not do it. Then they insisted I do it: we had just rented a 2 acre field filled with weeds. In order to park cars there, I had to clear that field. With a dull machete. I sharpened the machete, and bought some gloves. But within a few hours my hands were too blistered to work.

The owner insisted I continue. I went out there, worked until my hands were bloody, and came into his office. I removed my gloves, left two bloody handprints on the old-fashioned desk blotter, and said "I quit".

Some things you can't optimize away.

(There were some consequences for the owner, however...)

(no subject)

Date: 2013-03-20 11:05 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
Wow. It's easy to say from distance and perspective, but I don't know if I could have stuck around that facility as long as you did. Driving away efficient workers seems like a great way to ruin a business. They should have had you teach the techniques at every step of the way to improve everyone's efficiency.
Edited Date: 2013-03-20 11:06 pm (UTC)

(no subject)

Date: 2013-03-20 10:18 pm (UTC)
From: [identity profile] be-well-lowell.livejournal.com
One other point is that O-notation only refers to the number of comparisons required (with the typical and worst cases sometimes being different polynomials). It's really easy on modern systems to end up with memory fetches dominating the runtime. Both quicksort and merge sort can maintain locality of reference, but naive implementations will probably thrash the caches.

(no subject)

Date: 2013-03-20 11:24 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
For those who find themselves enjoying the ideas of algorithms without the specifics of programming languages, you might want to check out a couple of books:

Computational Fairy Tales - http://is.gd/yWf7Av
Best Practices of Spell Design - http://is.gd/ALQEM4

(no subject)

Date: 2013-03-21 06:35 am (UTC)
From: [identity profile] elektra.livejournal.com
I miss Outer Limits. There is no comic book shop around here (Delaware, where we relocated).

(no subject)

Date: 2013-03-21 01:19 pm (UTC)
From: [identity profile] meranthi.livejournal.com
I'm relocating to Kentucky later this year. I'm not sure how I'm going to survive without my geek mainstays.

*Sniff*

Date: 2013-03-23 02:55 pm (UTC)
From: [identity profile] fairdice.livejournal.com
In, I think, 1992, we were over at your house and Jane had a large bunch of index cards that needed sorting. She figured that she would basically do a merge sort, maybe getting help with the individual shards, but was grumbling about the work of the merge that she'd have to do herself at the end.

I pointed out that she could do a bucket sort: Deal out all the cards into piles "starts with A", "starts with B", etc; get each pile sorted; and then at the end just stack all the piles on top of each other. She was gob-smacked, and told me "You've changed the way I will sort things for the rest of my life."

I hadn't thought of that day in years... :-)

(no subject)

Date: 2013-03-24 03:31 am (UTC)
From: [identity profile] eclecticmagpie.livejournal.com
I haven't read *all* the comments, but I think I got through enough of them to still think I have something to say.

If the ultimate goal is to sell them,

a) don't do any more sorting unless you enjoy it more than anything else you could do with that time.

b) when you're ready to sell, put the comics up one box at a time, *indexing* but not necessarily sorting each box as you do so (though you can sort within a box while listing them if you want to). Include markers to make it easier to find a section within a box.

c) repeat until done.

All remaining sorting takes place on the computer; fetching requires referring to multiple indices -- but you don't have to fetch something until it's actually going to sell, so you have a reward for the work of fetching it.

(no subject)

Date: 2013-03-24 03:44 pm (UTC)
From: [identity profile] eclecticmagpie.livejournal.com
Ah. I had thought that they were ALL already in the "definitely sell" bucket. My suggestion was based on that premise. Also, I did NOT think that they would be easier to sell in runs, but you're probably right that they will be. Finally, part of my thinking was that you would actually be starting to sell things as soon as the first box was listed, and then could get around to listing the second box when you bloody well felt like it -- which still strikes me as a good idea.

So, altering my suggestion to face reality, you could do your final sort on one box (dividing it into three buckets in the process, start listing the appropriate bucket (fully sorted) immediately, and then merge other boxes into the three buckets gradually at your leisure. Some runs would get broken up in the process, but you'll still have the incenticization of seeing sales start happening -- or, if you don't see any sales, you might decide that it's a bad plan and move in a different direction.

I agree with Golden Square, you're acting like a hoarder; having decided to get rid of the comics, you've chosen the slowest possible path to actually doing so (this is the pot calling the kettle black, of course) -- could I interest you in some old-but-not-antique tomes on medieval philosophy?

Profile

jducoeur: (Default)
jducoeur

October 2025

S M T W T F S
   12 34
567891011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags