Transforming Sets of Strings to their Common Prefix Notation

Update Aug 20 2022:

Any two strings s1, s2 have a common prefix cp, which is the string of characters that s1 and s2 have in common up from the start. If s1, s2 have no such common characters, cp is the empty string. Let 1, 2 be the remainders or suffixes of s1 and s2 if cp is removed from them. Then, the common prefix notation (CPN) of s1, s2 – in short, cpn(s1,s2) – shall be:

cpn(s1,s2) := cp{1,2}

Given the definition above, for any given CPN of a set of strings s1, s2, …, sn, the CPN of the set extended by an additional string sn+1 is defined as follows:

Case 1: If the suffix of sn+1 with cp removed, n+1, has no nonempty common prefix with any of the suffixes 1, 2, …, n:

cpn(s1,s2,…,sn+1) := cp{1,2,…,n+1}

Case 2: Otherwise, since the first characters of 1, 2, …, n are distinct, there can only be one element m that has a nonempty common prefix with n+1. Given m to be that element:

cpn(s1,s2,…,sn+1) := cp{1,2,…,cpn(m,n+1),…,n}

Examples:

cpn("a", "b") = "{a,b}"
cpn("aa", "ab") = "a{a,b}"
cpn("aa", "ab", "abc", "abd") = "a{a,b{c,d}}"
cpn("a", "ab", "abc") = "{a{b{c}}}"

Note: For the sake of reasoning, strings containing the CPN’s reserved characters {, } and , shall be considered invalid input. In a practical implementation, a syntax for „escaping“ these reserved characters should be available.

Given a set of strings, one per line, on standard input, the following Python program prints the CPN of these strings on standard output: http://tk-sls.de/ref/trie1.py

Try the examples:

./trie1.py << EOF
a
b
EOF
./trie1.py << EOF
aa
ab
EOF
./trie1.py << EOF
aa
ab
abc
abd
EOF

The following example renders the wrong output, because a trie does not actually store the common prefix (this is what a suffix tree would accomplish), it just stores different prefix paths, and a, ab and abc have no different prefix paths, it’s just a degenerate tree a → b → c:

./trie1.py << EOF
a
ab
abc
EOF

So my guess is, i must implement one of these fancy suffix tree algorithms to get it done right. (see the update instead) 🙂

In the meantime, try this:

find /etc -type d | ./trie1.py

🙂

Imprint RSS