Wednesday, September 10, 2008

New Control Structure Considered Useful

We have a large data processing problem at work. Basically, we get thousands of items every day and have to figure out which ones go together. The only way to know if they go together is to try them. Some items may not go with anything, some may fit with a hundred or more other items. The one nice thing is that once we've found 3-5 items that go together, we can zip through all the other members of the group.

The problem is finding those 3-5 items. Choosing every possible combination of 5 would take too long. Choosing every combination of 3 is fast enough, but leaves a lot of extras. The simple solution is to first try all the combos of 3, then try all the combos of 4 on the remainder, then try all the combos of 5 on the remainder of that. The smaller and smaller pile makes the larger and larger choices feasible.

My boss, who is a competent practical programmer, suggests we have 3 procedures: One that does a 3 nested loop, one that does a 4, and one that does a 5. Like this (in Tcl):

set items {apple orange banana grape strawberry}
set count 5

for {set i 0} {$i < $count} {incr i} {
    for {set j [expr $i + 1]} {$j < $count} {incr j} {
       for {set k [expr $j + 1]} {$k < $count} {incr k} {
         set string [lindex $items $i]
         lappend string [lindex $items $j]
         lappend string [lindex $items $k]
         puts $string
       }
    }
}

That's just the 3 level loop because the other two look almost the same. The 4 and 5 level loops are identical except for having additional levels. As a programmer who values elegance over readability, the phrase "identical except for" is a red flag. Why have 3 separate procedures when all you are adjusting is a single parameter? What I need is a new control structure that is basically a for loop but lets me control how many nested fors there are.

Tcl makes it easy to create new control structures. Here's the control structure definition:

# Usage:
#  indexcount - how many items are being chosen
#  itemcount  - how many items are being chosen from
#  indexvar   - variable to hold current combination of indexes
#  body       - code to execute for each combination
proc chooseloop {indexcount itemcount indexvar body} {

    if {$indexcount > $itemcount} { 
      error "More indexes than items"
    }

    if {$indexcount == 0} { return }

    set indexes {}
    for {set i 0} {$i < $indexcount} {incr i} {
       lappend indexes $i
    }

    set maxindexval [expr $itemcount - 1]

    while {1} {
 
       # make new body that sets indexvar first
       set newbody "set $indexvar {$indexes}\n$body"

       # do this iteration
       uplevel 1 [list eval $newbody]

       # find incrementable index
       set found no
       for {set i 0} {$i < $indexcount} {incr i} {
          set index [lindex $indexes end-$i]
          set thismax [expr $maxindexval - $i]
          if {$index < $thismax} {
             set found yes
             break
          }
       }

       if {!$found} { break }

       # increment this index and set all following ones
       set incrindex [expr ($indexcount - 1) - $i]
       set precedingval [lindex $indexes $incrindex]
       for {set j $incrindex} {$j < $indexcount} {incr j} {
          lset indexes $j [expr $precedingval + 1]
          set precedingval [lindex $indexes $j]
       }
    }
}

Now to get my 3 level loop, I can call like this:

chooseloop 3 5 indexes {
    set str ""
    foreach index $indexes {
       append str "[lindex $items $index] "
    }
    puts $str
}

In order to do 3, then 4, then 5, I can call like this:

for {set i 3} {$i < 6} {incr i} {
    chooseloop $i 5 indexes {
       set str ""
       foreach index $indexes {
          append str "[lindex $items $index] "
       }
       puts $str
    }
    puts "--"
}

The output of that last one is:

apple orange banana 
apple orange grape 
apple orange strawberry 
apple banana grape 
apple banana strawberry 
apple grape strawberry 
orange banana grape 
orange banana strawberry 
orange grape strawberry 
banana grape strawberry 
--
apple orange banana grape 
apple orange banana strawberry 
apple orange grape strawberry 
apple banana grape strawberry 
orange banana grape strawberry 
--
apple orange banana grape strawberry 
--

3 comments:

  1. I thought you'd like that.

    And the great news is that for a given nesting level it only takes 10 times longer than the nested loop version! There's nothing Tim loves more than putting elegance over speed, so I expect a big raise.

    ReplyDelete
  2. I made this program mainly as a joke, but actually now I'm a little puzzled: Why exactly is it so slow? If I put the exprs in {}, I shave off about 10-20%, I can see that.

    The main problem seems to be looping through the indexes in the caller's body. If I could somehow set some named variables, I think it would be nearly as fast as the nested version. But of course I'd have to have a constant-size list of named variables, which doesn't help me.

    ReplyDelete