Sorting (Was: BogoSort)

Gabriel Sechan gsechan at hotmail.com
Sun May 12 13:43:36 PDT 2002


>
[snip]
>The average case for quicksort is n*log(n).

"Average case" being randomized data, "average over all permutations"?

>                                         It isn't that complicated a
>sort (some of the optimizations may be, but the basic version isnt too
>bad).

But it's the optimizations that even pretend to make it useful. :)

A little searching for various sorts is interesting. Quicksort takes
a LOT more explanation of what's going on and why it works than most
other sorts (merge, heap, insertion, etc.).


Guess it depends on the person.  It took me a while to understand heap back
in school, I got quicksort right away.  But lets face it, if we're doing
serious work we have a library of sort functions we can reuse by now, I
hope.

    >        The constant on it is much less than merge sort, though.  If 
you
>  have large amounts of data, its usually the best to use.

...and you don't mind it blowing up to O(n^2) at inconveninet times.

"Usual" depends on context. :)


Right.  Insertion is better if its almost sorted.  Quicksort is better if it
isnt.  Quicksort is only bad if the pivot is very high or very low compared
to the median of the array.  WIth middle of 3, thats a very rare case, given
only by unlucky random numbers (and then usually fixed in the next call to
sort the 1st subarray-  1 or 2 of the numbers used to find the pivot are
changed).

    > If the dataset is small, just use bubble, its the easiest to write.

Aaaaaaagh!

Insertion, selection, or comb sort are better than bubble, and just
as easy to write.


Bah.  If Im sorting once, i dont care if it takes 10 ms or 20.  Bubble just
comes to my mind first, so I write it.  I only optimize to one of those if
its being called repeatedly.

    [snip]
>Thats the quicksort weak case.  It can go n^2 on sorted data, as it can
>cause a bad pivot.

In RL, that's _bad_. Sorting partially sorted data is common.


THats where the middle of 3 optimization comes in.  Its a simple opt, but
fixes the almost sorted case to nlogn.


    >                     Of course the middle of 3 optimization (pivot is 
the
>average (or middle, however you want to do it) of the beginning middle,
and
>end values in the array) fixes that.

But that increases the constant multiplier, doesn't it?


Not by much.  The difference on a pentium is about 7 clock cycles per
function call (or loop if you're doing it iteratively).  Probably worth
spending.

    It all comes down to how often you think Murphy will design your input.
I'm paranoid, so I think worst-case is more common than not... but if
you don't think that way, then quick sort is appealing...


With middle of three op fixing the partially sorted case, I think its pretty
safe. I do think its faster, on average, than merge sort.

    [snip]
>Radix can be bad.  Radix is n*b  where n is the number of digits, and b
is
>the time to sort based on 1 digit.  If you have the memory, counting and
>bucket are nice O(n) sorts.

Counting-and-bucket is just a bin sort, right? Radix is a generalized
form of binsort, right?


Counting and bucket are bin sorts.  Radix is slightly different.  Its an
iterative sort-  sort on 1 digit, then sort on the next, then sort on the
third.  So it can be bad.  Its interesting if you want to partially sort the
array (sort on 1 or 2 digits).  That and the fact its a stable sort that is
easily copied by mechanical devices (card readers used radix sort).  It can
be very slow, though-  Ive seen mergesort out preform it.  Its very input
dependant (smaller number of digits in the max value of the array, the
better).  It does, however, work well with bin sorts as it reduces the
storage requirement for counting sort, although it must run them multiple
times.  Total time for a radix with counting is  n*log10(max(array)), with
40 bytes of storage required.  So the question is if
log10(max(array))<log2(n).

    >                              They just aren't done in place.

In theory, in-place sorts are ideal. In practice, I'm told that in-place
sorts are frequently regarded as an academic obsession.


The problem with bin sorts are the memory requirements.  If you use
counting, you need an array of length n, where n is the max value fo the
array (and counting has 4 loops over n, giving it a large multiplier).
Bucket needs an array of linked lists of size n, where n is the size of the
array (or a matrix of size nxn).  It also has an absolute worst time of n^3
(n time to partition, worst case all values in 1 partition.  Then sort the
partition, worst case n^2 for an unlucky insertion sort.  Insertion used
because the average case is a nearly sorted linked list).

So doing a counting sort over an array with a large variance in numbers is
bad (and it wont work on floats).  Doing a bucket has a bad worse case for
closely packed numbers and has high memory requirements for large arrays,
but works well with floats or large value spreads.

It all depends on your memory requirements.  4*(max(array)-min(array)) for
counting, or 12*n bytes for bucket  (Assuming 4 byte ints, 4 byte pointers).
Have a few hundred spare megs, or the time to radix sort?



A question that just occured to me-  I wonder if a function with the
knowledge of all sorts mentioned above would be useful?  I mean a function
that can examine the array, and quickly determine the optimal sort, then
call it.  Anyone tried to do that, or used such a function?


Gabe



_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com




More information about the KPLUG-LPSG mailing list