Table of contents
Understanding the KMP Algorithm: Efficient Pattern Matching
The Knuth-Morris-Pratt (KMP) algorithm is a powerful string matching technique that optimizes the search process by avoiding redundant comparisons. It is particularly useful when you need to find all occurrences of a "pattern" string in a "text" string, especially for large inputs.
In this blog, we’ll dive deep into how the KMP algorithm works, walk through its implementation, and explore practical use cases.
The Problem: Naive Pattern Matching
A naive approach to pattern matching involves comparing each substring of the text with the pattern. While simple, it can be inefficient, particularly when repeated comparisons occur after mismatches. For example:
Text: ababababc
Pattern: ababc
After failing at position 3, the naive algorithm restarts comparisons from the next character, even though we already know part of the substring matches.
This redundancy is what KMP avoids.
How the KMP Algorithm Solves This
The KMP algorithm eliminates redundant comparisons by precomputing a Longest Prefix Suffix (LPS) table. This table determines how far to "jump" in the pattern when a mismatch occurs.
The LPS Table: Key to Efficiency
The Longest Prefix Suffix (LPS) table tells us:
- The length of the longest prefix of the pattern that is also a suffix for each substring of the pattern.
For example, consider the pattern "abcabcd"
:
Index ii | Substring | Longest Prefix Suffix | LPS Value |
0 | a | None | 0 |
1 | ab | None | 0 |
2 | abc | None | 0 |
3 | abca | a | 1 |
4 | abcab | ab | 2 |
5 | abcabc | abc | 3 |
6 | abcabcd | None | 0 |
The LPS table for this pattern is: [0, 0, 0, 1, 2, 3, 0]
.
How KMP Works
Preprocessing Phase:
- Build the LPS array for the pattern.
Matching Phase:
Compare the text and pattern.
On a mismatch, use the LPS array to determine how far to shift the pattern without rechecking already matched characters.
KMP Algorithm: Java Implementation
Here’s how you can implement the KMP algorithm:
Step 1: Build the LPS Table
private static int[] computeLPS(String pattern) {
int m = pattern.length();
int[] lps = new int[m];
int length = 0; // Longest prefix suffix
int i = 1;
lps[0] = 0; // First element is always 0
while (i < m) {
if (pattern.charAt(i) == pattern.charAt(length)) {
length++;
lps[i] = length;
i++;
} else {
if (length != 0) {
length = lps[length - 1];
} else {
lps[i] = 0;
i++;
}
}
}
return lps;
}
Step 2: Search for the Pattern
public static void KMPSearch(String text, String pattern) {
int n = text.length();
int m = pattern.length();
// Step 1: Compute the LPS Array
int[] lps = computeLPS(pattern);
int i = 0; // Pointer for text
int j = 0; // Pointer for pattern
while (i < n) {
if (text.charAt(i) == pattern.charAt(j)) {
i++;
j++;
}
if (j == m) { // Pattern match found
System.out.println("Pattern found at index: " + (i - j));
j = lps[j - 1]; // Continue searching for further matches
} else if (i < n && text.charAt(i) != pattern.charAt(j)) {
if (j != 0) {
j = lps[j - 1]; // Use LPS to skip comparisons
} else {
i++;
}
}
}
}
Step 3: Run the Algorithm
public static void main(String[] args) {
String text = "aaabcabcdabcabcabcd";
String pattern = "abcabcd";
KMPSearch(text, pattern);
}
Example Walkthrough
Input:
Text:
"aaabcabcdabcabcabcd"
Pattern:
"abcabcd"
Execution:
Compute the LPS table for the pattern:
- LPS:
[0, 0, 0, 1, 2, 3, 0]
- LPS:
Match the pattern in the text:
Start matching from index 3:
"abcabcd"
matches.Skip redundant comparisons and find another match at index 12.
Output:
Pattern found at index: 3
Pattern found at index: 12
Complexity Analysis
Preprocessing Phase:
Time Complexity: O(m)O(m), where mm is the length of the pattern.
Space Complexity: O(m)O(m), for the LPS table.
Matching Phase:
- Time Complexity: O(n)O(n), where nn is the length of the text.
Overall Time Complexity: O(n+m)O(n+m)
When to Use KMP
Text Processing: Searching for a substring in documents or logs.
Bioinformatics: Pattern searching in DNA sequences.
Search Engines: Finding keywords in large texts.
Code Editors: Implementing "Find" or "Search" functionality.
Why KMP is Better
The naive algorithm can have a worst-case complexity of O(n×m)O(n×m), while KMP guarantees linear performance relative to the size of the input. It is especially effective when repeated patterns or large texts are involved.
Conclusion
The KMP algorithm is a cornerstone of efficient string matching. By leveraging the LPS table, it eliminates unnecessary comparisons and optimizes performance. Whether you’re building search features in applications or analyzing large datasets, KMP is an indispensable tool.
Try implementing this in your projects and see its magic in action! 🚀