Tutorial
String Searching
Medium
String Searching
Overview

String Searching by KMP algorithm (Knuth Morris Pratt algorithm)

Motivation Problem: Given $$2$$ strings $$P$$(pattern) and $$T$$(text), find the number of occurrences of $$P$$ in $$T$$.

Basic / Brute Force solution:

One obvious and easy to code solution that comes to mind is this: For each index of $$T$$, take it as a starting point and find if $$T_{i,i+1,...,i+|P|-1}$$ is equal to $$P$$.

for i = 0 to length(T)-length(P):

    Found = true

    for j = i to i + length(P) - 1:

        if P[j-i] not equal to T[j]:

            Found = False

    if Found = True 

        answer = answer + 1

This brute force takes $$O(|P| \cdot |T|)$$ time in the worst case, which is obviously too slow for large strings.

Knuth Morris Pratt Algorithm:

Suppose for each index $$i$$ of some string $$Z$$, the longest suffix in $$Z_{0,1,...,i}$$ that is also a prefix of $$Z_{0,1,...,i}$$, be known. Formally, a length $$F_i$$ is known such that $$Z_{0,1,...,F_i-1}$$ = $$Z_{i-F_i+1,...,i}$$. Let these lengths be stored in array $$F$$. The suffix needs to be proper(whole string is not a proper suffix).

Then the solution to the motivation problem can be found as follows:

Define a string $$V = P + '#' + T$$, where $$'#'$$ is a delimiter that is not present in either of $$P$$ or $$T$$. Now, if the above information is known, all occurrences of $$P$$ in $$T$$ can be found as follows: If at some index $$i$$, $$F_i = |P|$$, then there is an occurrence of Pattern $$P$$ at position $$i-|P|+1$$. All such indices from $$|P|+1$$ [0 based indexing, the index just after '#'], need to be checked.

The main part of KMP algorithm calculates the array $$F$$, which is also called the prefix function. If calculation of $$F$$ or the prefix function can be done efficiently, then we have an efficient solution to the motivation problem. KMP algorithm finds the prefix function in $$O(length of String)$$ time.

To find the prefix function, best possible use of previous values of array $$F$$ is made, so that calculations aren't done again and again. Suppose all $$F_i$$ have been calculated, and now $$F_{i+1}$$ is to be calculated. It is to be noted that, value of $$F_{i+1}$$ can be at most 1 greater than $$F_i$$. Here is a proof by contradiction:

Suppose $$F_{i+1} > F_i + 1$$. Now, if the $$(i+1)^{th}$$ character is removed, we obtain a suffix ending at index $$i$$ that is of length $$F_{i+1} - 1$$, which is greater than $$F_i$$. This is a contradiction, hence proved.

Observe that if $$Z_{i+1} = Z_{F_i}$$, then the value of $$F_{i+1} = F_i + 1$$. If not, a smaller suffix ending at index $$i$$ is to be found, that is also a prefix of $$Z_{0,1...,i}$$. Let the length of such a suffix be $$j$$, then if $$Z_{i+1} = Z_{j}$$ then $$F_{i+1} = j + 1$$. If again the equality doesn't hold true, smaller and smaller suffixes that end at index $$i$$, which are also prefixes of $$Z_{0,1...,i}$$ need to be found.

The only thing remaining is, how to find the length of next smaller suffix ending at index $$i$$, that is also a prefix? This is also pretty simple. Observe that due to the property of $$F$$, the segment $$Z_{0,1,...,F_i-1}$$ is equal to the segment $$Z_{i-F_i+1,...,i}$$. So to find the next smaller suffix ending at index $$i$$, the longest suffix ending at $$F_i - 1$$ can be found which is $$F_{F_i-1}$$, and this suffix will be the next smaller suffix ending at index $$i$$. If this suffix also doesn't satisfy our criteria, then smaller suffixes can be found with the same process, here it will be $$F_{F_{F_i-1} - 1}$$. Note that, if at some point the length becomes $$0$$, the process is stopped.

This completes $$KMP$$ algorithm. Below is the code:

vector<int> prefix_function (string Z) {

    int n = (int) Z.length();

    vector<int> F (n);

     F[0]=0;

    for (int i=1; i<n; ++i) {

        int j = F[i-1];

        while (j > 0 && Z[i] != Z[j])

            j = F[j-1];

        if (Z[i] == Z[j])  ++j;

        F[i] = j;

    }

    return F;

}

Finding the $$F$$ array for "ABCABC"

Initially, $$F_0 = 0$$.

Index $$1 \rightarrow F_{0} = 0$$, $$j$$ does not go into while loop and $$Z_j \neq Z_i$$, therefore value of $$F_i = 0$$.

Index $$2 \rightarrow F_{1} = 0$$, $$j$$ does not go into while loop and $$Z_j \neq Z_i$$, therefore value of $$F_i = 0$$.

Index $$3 \rightarrow F_{2} = 0$$, $$j$$ does not go into while loop and $$Z_j = Z_i$$, therefore value of $$F_i = 1$$.

Index $$4 \rightarrow F_{3} = 1$$, $$j$$ satisfies while loop condition but $$Z_j = Z_i$$, hence does not go into while loop, therefore value of $$F_i = 2$$.

Index $$5 \rightarrow F_{4} = 2$$, $$j$$ satisfies while loop condition but $$Z_j = Z_i$$, therefore value of $$F_i = 3$$.

Problem
Apply KMP
3.6 (12 votes)
Approved, very Easy, string
78% Success 12574 Attempts 10 Points 5s Time Limit 256MB Memory 1024 KB Max Code

Given 2 strings, P and T, find the number of occurrences of P in T.

Input:

First line contains string P, and second line contains the string T.

Output:

Print a single integer, the number of occurrences of P in T.

Constraints:

\(1 \le |P| \le |T| \le 10^{5}\)

Please login to use the editor

You need to be logged in to access the code editor

Loading...

Please wait while we load the editor

Loading...
Results
Custom Input
Run your code to see the output