Suffix Array Construction: Efficient String Indexing Explained with Examples

Suffix Array is a fundamental data structure in string algorithms used for efficient text processing and pattern searching. It provides a compact alternative to suffix trees with simpler implementation and reduced memory usage. Understanding suffix array construction is crucial for computer science learners, competitive programmers, and developers working with large-scale string data.

Table of Contents

What is a Suffix Array?

A suffix array is an array of integers that represents the starting positions of all suffixes of a string arranged in lexicographic (dictionary) order. It enables fast searches of substrings and efficient text manipulations, laying the foundation for problems like substring search, longest repeated substring, and bioinformatics sequence analysis.

Example

Let’s take the string: S = "banana"

All suffixes of “banana” are:

0: banana
1: anana
2: nana
3: ana
4: na
5: a

Sorted lexicographically (dictionary order):

5: a
3: ana
1: anana
0: banana
4: na
2: nana

Thus, the suffix array is:

[5, 3, 1, 0, 4, 2]

Applications of Suffix Array

Substring Search: Fast implementations of substring lookups using binary search.
Longest Common Prefix (LCP): Used with suffix arrays to compute repeated patterns.
Data Compression: Basis for algorithms like Burrows–Wheeler transform in compression technologies.
Bioinformatics: DNA sequence analysis for finding gene matches.

Naïve Construction of a Suffix Array

The simplest approach is:

Generate all suffixes of the string.
Sort them in lexicographic order.
Record starting indices in the sorted order.

Python Example


def build_suffix_array(s):
    suffixes = [(s[i:], i) for i in range(len(s))]
    suffixes.sort()
    return [index for (suffix, index) in suffixes]

print(build_suffix_array("banana"))

Output: [5, 3, 1, 0, 4, 2]

Complexity: O(n² log n) due to expensive string comparisons.

Efficient Suffix Array Construction

Naïve construction is too slow for long strings. Efficient methods construct suffix arrays in O(n log n) or even O(n) time.

Prefix Doubling Algorithm (O(n log n))

The prefix doubling approach progressively sorts suffixes using rank comparisons of prefixes of lengths 2, 4, 8, etc.

Steps:

Assign integer ranks based on first characters.
Sort suffixes by (rank[i], rank[i+k]) for k = 1, 2, 4, …
Repeat until ranks are distinct or k exceeds string length.

Python Implementation (Prefix Doubling)


def build_suffix_array_efficient(s):
    n = len(s)
    k = 1
    rank = [ord(c) for c in s]
    temp = [0] * n
    sa = list(range(n))

    while True:
        sa.sort(key=lambda x: (rank[x], rank[x+k] if x+k < n else -1))
        temp[sa[0]] = 0
        for i in range(1, n):
            prev, curr = sa[i-1], sa[i]
            temp[curr] = temp[prev] + ((rank[prev], rank[prev+k] if prev+k < n else -1) != (rank[curr], rank[curr+k] if curr+k < n else -1))
        rank = temp[:]
        if rank[sa[-1]] == n-1:
            break
        k *= 2

    return sa

print(build_suffix_array_efficient("banana"))

Output: [5, 3, 1, 0, 4, 2]

Suffix Array with LCP (Longest Common Prefix)

For many problems like substring frequency analysis, we use the LCP array, which stores the longest common prefix length between consecutive suffixes in the suffix array.

Kasai’s Algorithm for LCP Construction


def build_lcp(s, sa):
    n = len(s)
    rank = [0] * n
    lcp = [0] * (n-1)
    for i in range(n):
        rank[sa[i]] = i
    h = 0
    for i in range(n):
        if rank[i] > 0:
            j = sa[rank[i]-1]
            while i+h < n and j+h < n and s[i+h] == s[j+h]:
                h += 1
            lcp[rank[i]-1] = h
            if h > 0:
                h -= 1
    return lcp

s = "banana"
sa = build_suffix_array_efficient(s)
print(build_lcp(s, sa))

Output: [1, 3, 0, 0, 2]

Complexity Analysis

Naïve construction: O(n² log n)
Prefix Doubling: O(n log n)
Induced Sorting (SA-IS): O(n)
LCP construction: O(n)

Real-World Use Cases

Search Engines: Indexing text for fast lookups.
Data Compression: Used in compressions like bzip2 (via Burrows-Wheeler transform).
Bioinformatics: DNA alignment and genome searches.
Plagiarism Detection: Fast repeated substring identification.

Conclusion

Suffix Arrays are powerful data structures that transform complex string problems into manageable ones. From simple substring search to advanced genome sequencing, their impact is unmatched. With construction algorithms ranging from basic O(n² log n) to highly optimized O(n), suffix arrays remain an essential tool in algorithmic problem-solving and string processing.