Sorting, Indexing, Queries, and Data Compression

Why people sort, and how sorts are measured.

People sort for many reasons. Surprisingly, the way sorts are traditionally measured does not measure the total application time. The traditional sort measure is the elapsed time to read the input and write it in order. However, that does not include the time to run a subsequent application against the sorted data.

Sorting is never done in a vacuum, but is always combined with subsequent application(s). The total time is the time to sort plus the time to run the application(s) against the sorted output. The sort reads the data at least once, writes it at least once, and then the application reads the data again. Thus the total job transfers the data from disk to memory or back three times.

If the application can process the output of the sort as the sort produces it, then there is only one data transfer, not three. If the machine is fast enough, and there is enough memory to hold all of the input, then the elapsed time is equal to the time to read the input, and is most important. If the CPU time is longer than the I/O time, then the actual CPU time is most important.

If the input data is larger than memory, then the sort may copy it in and out of memory multiple times. Some sorts, like the one distributed with a popular operating system, even start by copying the entire file to a temporary file first, adding another complete copy operation to the time.

Some reasons for sorting are to prepare reports, to process updates against a master file, or a database, or to create an index for queries against a database.

How Skyelogic methods address the total application.

The Skyelogic collection of sorting and indexing methods, loosely called shortsort, improve the performance of the total job. Here are some of the things shortsort does to reduce the time and to make it easier for the user to use the results.

  • Reduces the CPU time by better algorithms. Many files only need to be created in memory, sorted, and then used, without any I/O at all. Benchmarks on in memory data show shortsort to be more than 5 times faster than quicksort, and takes the same amount of time, for any input order.

  • Stores up to 5 or 10 times more data in memory.

  • Reads and writes compressed files, either shortsort compressed alone, or in combination with gzip. Using gzip in combination results in a compression ratio twice that of using gzip alone. Compression ratios are from 10 to 1 to 36 to 1. Application programs can also read or write compressed files in any combination, with library functions.

  • Allows selecting the compression ratio depending on the CPU time cost, if it is faster to process the file with less efficient compression.

  • Allows the application to process the compressed output without having to first write the uncompressed data to disk.

  • Builds indexes on the compressed file, which can be used for queries on the compressed file, without first having to uncompress the file. This is especially important for archived data, which may be hundreds of gigabytes. You can access records starting in the middle of the compressed file.

  • Builds compressed indexes against the compressed data, which can be used for queries. There are multiple ways of storing the compressed indexes, depending on the CPU to I/O time ratio.

  • Searches compressed indexes in logarithmic time, without having to first uncompress them, which greatly improves the time for queries.

  • Retrieves sorted, compressed data in order by any subset of columns, by using stored index information that it uses to reorder the data without uncompressing it. Typically on a 1GHZ Pentium III, the reordering step can reorder a million records in memory with 17 variable length string sort keys in a few seconds.

  • Preserves the original order of equals. This allows sorting by additional columns, and retaining the order from prior sorts.

  • Uses compressed indexes on text files, for queries such as "all records having all of these words and none of those words", or for table file queries e.g. find all records where column x matches the pattern *.cgi, where * means any string. Combines pattern matching with the regular query predicates, like <, >, eq, ne, prefixof, suffixof, infixof, etc.. .

  • Parses text data via a user syntax table. Instantiates names as a result that it uses to construct fields, to validate data, to convert data, as from date to number and back, or to call any C++ application program and pass it the parsed fields.

  • Has a pattern matching function that runs in a time proportional to the number of matches found, no matter how many patterns there are. For example, to process a 100 megabyte file on a 1GHZ P-III against a table of 2,000 patterns takes 30 seconds. Currently it can store about 1 to 5 million pattern bytes in 1GB of memory. Work in progress will result in a 10 to 1 improvement in memory use, or 10 to 50 million pattern bytes per GB of memory.
  • Skyelogic Application Programming Interface (API).

    Most of the above items have been implemented, tested, and run against many different files. There have been several versions of the shortsort sorts, and work continues until having all of the sorting, indexing, query functions, parseing and pattern matching available in a single interpreter.

    In a sense it will never be finished, because there will be new logic, but there is a C++ library of all of the underlying logic to do all of the above things. This web site will present an API, and example programs that use the API.

    Skyelogic can also supply custom programs, that use the existing tools, in a shorter time than usual, because the hard parts are already done, and very likely run much faster than you can do by yourself.

    Over the course of decades, Skyelogic has continued research and development of the collection of functions, and provided dozens of companies specialized applications using them. Some have based an entire business on the the applications' performance.

    There are currently hundreds of programs and macros. At this point, this site is completely dependent on contributions for support. If you care about any of the above enough to send us a contribution, please tell us what you are most interested in, and we will document them first.

    It is our intention to make the compiled libraries for various unix and windows platforms, and the header files available free. Remember, free doesn't mean support. If you have a problem using it, and send email about it, there is no guarantee that you will get a response, because time is limited. However, if you send a contribution, and then ask for help, we will try to help you. If you would like other kind of support, such as debugging help, instruction, examples of use, and a maintenance subscription, or if you have a need for source code, contact Skyelogic.

    Send contributions payable to Skyelogic, PO box 1847 SRB, Poughkeepsie NY 12601, and tell us what you are most interested in. Your contributions can help speed up the documentation.

    If you have are planning a programming project, feel free to ask for estimates for specialized applications.

    Send email to support at skyelogic.com

    Preview of Coming Topics.

    C++ Container, List, or Sequence class:
  • All of the following are ideas that have been used for decades by many people. The only new thing is organizing it under one class.
  • An element can be any string, number, or fixed size object of any type.
  • Map element to index, index to element.
  • Enumerate in lexicographic or numeric order.
  • Add new elements on the end without incurring the overhead of mallocs nor chained lists.
  • Never needs to allocate large contiguous areas of memory, but allocates in blocks.
  • Constant access time to every element, equal to a small number of machine instructions.
  • Free the entire container without any recursion.
  • Allow accessing elements via 32 or 16 bit indices, instead of with 64 bit pointers, thus saving half the memory for references in 64 bit mode.
  • Can store element indices instead of element values to refer to values.
  • Each unique value in a set is only stored once.
  • C++ Table class:
  • Allows row and column access to tables, stored either in memory or on disk, or partially paged in.
  • Elements are stored in the smallest appropriate representation for each block. Choosing the smallest representation can be different across blocks. If a block has only zeros and ones, then it can be stored as bits. If it only has number from 0 to 255, it can be stored as bytes, etc.. Selecting the tradeoff is dependent on the speed to space ratio, which can be specified.