Aditya Nambiar
Aditya Nambiar

Reputation: 886

Kleene Closure for infinite subset

Let L = {an | n >= 0}, where enter image description here and enter image description here for all n >= 1.

Thus L consists of sequences of a of all lengths, including a sequence of length 0. Let L2 be any infinite subset of L. I need to show there always exists a DFA to recognize (L2)*.

If L2 is a finite subset it is very obvious as L2 would be a DFA and hence by kleene closure L2* would be recognized by a DFA. But I am unable to get it for infinite subset as L2 may not be expressed as DFA for eg length of strings is prime.

Upvotes: 0

Views: 906

Answers (1)

nhahtdh
nhahtdh

Reputation: 56809

Approach

While there exists a DFA to describe a set L of all strings an, n >= 0, there is no guarantee that a DFA exists for all subsets of L. The subset of L which contains all strings whose length is prime, as you have mentioned, is one example where a DFA the describes the language does not exist.

The correct approach would be to directly prove that (L')* is a regular language for any subset L' of L.

Definition

Let us define GCD(K) = GCDw ∈ K, |w| > 0 (|w|), where K is any non-empty subset of L. We can now refer to the greatest common divisor of all the lengths of all non-empty words in a language K as GCD(K). This definition applies for both finite and infinite subset of L.

Similarly, we can define LCM(K) = LCMw ∈ K, |w| > 0 (|w|), where K is any non-empty and finite subset of L.

Proof

We will try to prove that when GCD(L') = 1, there exists a number M such that all string an, n >= M belongs to the language (L')*. This leads to (L')* being a regular language, since we can construct a regular expression of the form:

All strings of length less than M and belongs to (L')*
OR
All strings of length more than or equal to M

The regular expression above has a corresponding DFA which has M + 1 states.

When GCD(L') > 1, we can reduce the problem to the case of GCD = 1 by "dividing" all words in the subset L' by GCD(L').


If GCD(L') = 1 (set-wise coprime), there exists a finite subset S of L' where GCD of the length of all strings in S is also 1.

We can prove the claim above by construction.

  • Pick any element w1 from L', |w1| > 0 and construct set S1 = {w1}
  • If GCD(Sn) = 1, Sn is the set we want to find.
  • If GCD(Sn) > 1, pick an element wn+1 from L' and construct set Sn+1 = {wn+1} ∪ Sn, so that
    GCD(Sn+1) < GCD(Sn)

If GCD(Sn) > 1, there always exists an element from set L' that decreases the GCD when we add it to the set; otherwise, the GCD of the set L' cannot be 1. And since the length of the first element w1 has finite number of factors, the size of the final set S is finite.

Back to the problem, for any subset L' of L, we can find a finite subset S of L', which satisfies GCD(L') = GCD(S). From the set S, we can construct a generalized linear Diophantine equation with |S| unknowns ai:

a1|w1| + a2|w2| + ... + a|S||w|S|| = c where c is a non-negative integer

Since GCD(S) = 1, the equation above is always solvable, by recursively applying the solution to the simplest form of linear Diophantine equation ax + by = c.

Solve the generalized Diophantine equations above for c = 0 to (LCM(S) - 1). The solutions (a1, a2, ..., a|S|) can contain negative numbers. However, we can keep adding multiples of LCM(S) to both sides of the equations until all the solutions contain only non-negative numbers.

Let k be the smallest multiple of LCM(S) so that all Diophantine equations for c = k * LCM(S) + q, q = 0 to (LCM(S) - 1) has non-negative solution. Then we can define M as k * LCM(S), since any strings whose length larger than M can be decomposed as concatenation of words in S (thus in L').

Example calculation based on the proof

Suppose L' is set of all strings in L whose length is prime.

Let us construct set S = {a2, a5}. S can be {a2, a19} or {a5, a23}, doesn't really matter. The final value of M might be different, but it doesn't affect the fact that (L')* is regular language.

We need to solve 10 equations (separately):

2a1 + 5a2 = 0 => (a1, a2) = (0, 0)
2a1 + 5a2 = 1 => (a1, a2) = (3, -1)
2a1 + 5a2 = 2 => (a1, a2) = (1, 0)
2a1 + 5a2 = 3 => (a1, a2) = (-1, 1)
2a1 + 5a2 = 4 => (a1, a2) = (2, 0)
2a1 + 5a2 = 5 => (a1, a2) = (0, 1)
2a1 + 5a2 = 6 => (a1, a2) = (3, 0)
2a1 + 5a2 = 7 => (a1, a2) = (1, 1)
2a1 + 5a2 = 8 => (a1, a2) = (4, 0)
2a1 + 5a2 = 9 => (a1, a2) = (2, 1)

Add one LCM(2,5) = 10. Note that we can modify the solution directly without solving again, due to the property of LCM:

2a1 + (5a2 + 10) = 1 + 10 => (a1, a2) = (3, 1)
(2a1 + 10) + 5a2 = 3 + 10 => (a1, a2) = (4, 1)

Since all the solutions are non-negative, and we only add one LCM(2,5), M = 10.

The regular expression for (L')* can be constructed:

a2+a4+a5+a6+a7+a8+a9+a10a*

The regular expression is not very compact, but it is not our concern here. For the sake of the proof, we only need to know there exists a number M so that an belongs to (L')* for all n >= M, which implies that there are finite number of states and a DFA can be constructed.

Upvotes: 1

Related Questions