Updated Common Prefix Notation Transformation

This is an update to my previous article on determining the common prefixes of a set of strings (passed a sequence of lines) and printing them in common prefix notation (CPN). The problem of reconstructing strings from the original input that are complete prefixes of other input strings is addressed.

The code is now in a Git repository:

Gitea project trieLast 3 commits: by Tilman Kranz: 69fc4506 add test, escape spaces in output by Tilman Kranz: 0ae62285 include formal definition by Tilman Kranz: 6099f6d2 Initial commitThis project has currently no releases.

The Problem

The previous version of trie.py had the following problem:

The following example renders the wrong output, because a trie does not actually store the common prefix (this is what a suffix tree would accomplish), it just stores different prefix paths, and a, ab and abc have no different prefix paths, it’s just a degenerate tree a → b → c:

./trie1.py << EOF
a
ab
abc
EOF

Consider construct_trie(), which builds the Trie data structure from a series of strings:

def construct_trie(word_list):
    root = dict()
    for word in word_list:
        level = root
        for letter in word:
            if letter not in level:
                level[letter] = dict()
            level = level[letter]

    return root

The algorithm initializes an empty associative array (a Python „dict“), which it calls „root“. It iterates a series of strings, which is calls „words“. In each string, it iterates the characters, which it calls „letters“. For the first letter, if it s not present as a key in the dictionary, it creates this key with an empty dictionary as value. The letter-loop takes the dictionary that is the value of the current letter key as new root and proceeds to the next letter. This way, the algorithm builds a data structure of nested dictionaries that contains all letters in all words and allows reconstruction of the order in which they appeared.

What is missing in this algorithm is the information if a letter is the last letter in the word. If this information was present, it was possible to not only reconstruct the series of letters in the words in order of appearance but also determine the word boundaries to allow for reconstruction of the individual words.

Consider the failed example above, the incorrect output was:

abc

This output reads: „a sequence of letters was seen, a → b → c.“

The correct output would have been:

a{,b{,c}}

This output reads: „letter a was seen as a standalone word but also was seen followed by a letter b, which terminated a standalone word but also was seen followed by a letter c.“

Implementation Change

The following change to construct_trie() modifies the letter-loop to determine if the current letter is the last letter in the current word. If true, it inserts an empty string as key with an empty dictionary as value into the dictionary for this letter. This empty entry in the letter dictionary stores the information that this letter has been a word boundary by making the empty string one possible suffix of the prefix rendered by the sequence of letters up to the current letter.

Here is a modified version of construct_trie() that accomplishes this:

def construct_trie(word_list):
    root = dict()
    for word in word_list:
        level = root
        l = len(word)
        last = l - 1

        for i in range(l):
            letter = word[i]
            if letter not in level:
                level[letter] = dict()
            if i == last:
                level[letter][''] = dict()
            level = level[letter]

    return root

I have published the updated program in a Git repository: https://tk-sls.de/git/tilman/trie

Changed Formal Definition

Any two strings s1, s2 have a common prefix cp, which is the string of characters that s1 and s2 have in common up from the start. If s1, s2 have no such common characters, cp is the empty string. Let 1, 2 be the remainders or suffixes of s1 and s2 if cp is removed from them. If s1 is a prefix of s2 then 1 is the empty string, and if s2 is a prefix of s1 then 2 is the empty string.

Given the definitions above, the common prefix notation (CPN) of s1, s2 – in short, cpn(s1,s2) – shall be:

Case 1.1: If neither 1 nor 2 is empty:

cpn(s1,s2) := cp{1,2}

Case 1.2: If 1 is empty and 2 is not empty:

cpn(s1,s2) := cp{,2}

Case 1.3: If 1 is not empty and 2 is empty:

cpn(s1,s2) := cp{1,}

Case 1.4: If both 1 and 2 are empty, then s1 and s2 are equal:

cpn(s,s) := s

Given the definition above, for any given CPN of a set of strings s1, s2, …, sn, the CPN of the set extended by an additional string sn+1 is defined as follows:

Case 2.1: If the suffix of sn+1 with cp removed, n+1, has no nonempty common prefix with any of the suffixes 1, 2, …, n:

cpn(s1,s2,…,sn+1) := cp{1,2,…,n+1}

Case 2.2: Otherwise, since the first characters of 1, 2, …, n are distinct, there can only be one element m that has a nonempty common prefix with n+1. Given m to be that element:

cpn(s1,s2,…,sn+1) := cp{1,2,…,cpn(m,n+1),…,n}

Examples:

cpn("a", "b") = "{a,b}"
cpn("aa", "ab") = "a{a,b}"
cpn("aa", "ab", "abc", "abd") = "a{a,b{,c,d}}"
cpn("a", "ab", "abc") = "{a{,b{,c}}}"

Note: For the sake of reasoning, strings containing the CPN’s reserved characters {, } and , shall be considered invalid input. In a practical implementation, a syntax for „escaping“ these reserved characters should be available. For example, in the reference implementation, the print_trie() function inserts a backslash character \ in front of every literal {, , and }.