Patent for Sale:

Minimum Edit Distance Algorithm    

Finds the Minimum Edit Distance / Longest Common Subsequence for small alphabets such as DNA sequences or bytes

Overview

This algorithm uses A* search using a proprietary counting heuristic to find the minimum set of changes -- insertions and deletions -- required to convert one given sequence into another. The key advantage occurs in small-alphabet cases, such as comparing DNA sequences, or comparing two binary files, byte-by-byte. The running time is more nearly linear than other algorithms such as Hunt and Szymanski (used in UNIX diff) and others.

Primary Application of the Technology

Delta compression -- for example send a different version of a web site
Virus scanning -- find the difference between a given e-mail or upload and known viruses
DNA sequence comparison -- compare two DNA sequences to find the minimum differences between them
Binary-file comparison -- compare two binary files and find the minimum differences

The Problem Solved by the Technology

Minimum edit distance (aka longest common subsequence) problem.
Compare DNA sequences and binary files in a reasonable amount of time, to find a byte-by-byte comparison.

How the Technology Solves the Problem

The algorithm utilizes A* search to search the graph that can be constructed from all truncations of strings A and B. For instance "cat" versus "bat" would give a space ("cat","bat");("ca","bat");("c","bat");("","bat");("cat","ba");("cat","b");("cat","");("cat","b");("ca","b");("c","b");("","b");("cat","");("ca","");("c","");("",""). Edges have a cost of "0" if the last character is the same, otherwise have a cost of "1" for stripping off the last character of one of elements of the pair. This space is searched using the patented counting heuristic.

Competitive Advantage

For sequences over small alphabets, such as DNA sequences or individual bytes, it runs in an amount of
time that is more nearly linear than quadratic. This makes it practical to execute such comparisons in a reasonable amount of time. For example, comparing two binary files of 1 megabyte might require on the order of 1 million operations with my algorithm, whereas it would require on the order of 1 trillion operations with more traditional techniques.

Patent Summary

U.S. Patent Classes & Classifications Covered in this listing:

Class 707: Data Processing:Database And File Management Or Data Structures

This is the generic class for data processing apparatus and corresponding methods for the retrieval of data stored in a database or as computer files. It provides for data processing means or steps for generic data, file and directory upkeeping, file naming, and file and database maintenance including integrity consideration, recovery, and versioning. There are three main divisions: 1. database and file accessing; 2. database schema and data structure; 3. file and database maintenance.