Reputation: 730
I have written these two algorithms to check a string for duplicate characters (ABBC, AAAC). The first uses the hashset data structure, whilst the second relies purely on iteration.
Algorithm 1
String s = "abcdefghijklmnopqrstuvwxxyz";
public boolean isUnique(String s) {
Set<Character> charSet = new HashSet<Character>();
for(int i=0; i<s.length(); i++) {
if(charSet.contains(s.charAt(i))) {
return false;
}
charSet.add(s.charAt(i));
}
return true;
}
Algorithm 2
String s = "abcdefghijklmnopqrstuvwxxyz";
public boolean isUnique2(String s) {
for(int i=0; i<s.length()-1; i++) {
for(int j = i+1; j<s.length(); j++) {
if(s.charAt(i) == s.charAt(j)) {
return false;
}
}
}
return true;
}
My thoughts are that the first algorithm is O(N), and the second algorithm is O(N^2). When i run execution time tests on my (possibly unreliable) laptop, the average speed for the first algorithm is 2020ns, whilst the second algorithm is 995ns. This goes against my calculation of the algorithms complexity, could anybody advise me?
Upvotes: 0
Views: 136
Reputation: 2257
There is a problem with your test data, for example, if you limit yourself to English language characters (a-z), you are guaranteed to have a duplicate if string length > 26. In the specific example you provided the string "abcdefghijklmnopqrstuvwxxyz"
is sorted and the duplicate element x
is found towards the end. As such iterated array lookup is faster because there is overhead in building the HashSet
as you continue to parse the String.
A better test would be to test this with randomly generated integer sequences of large sizes and with a large max value, e.g. Long.MAX_VALUE
Below is a test that disproves your assertion that array search is faster. Run it a few times and see for yourself. Or you could take averages from 1000 runs, etc:
public class FindDuplicatesTest {
public static final String s = generateRandomString(100000);
private static String generateRandomString(int numChars) {
Random random = new Random();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < numChars; i++) {
int codePoint = random.nextInt(65536);
sb.append(Character.toChars(codePoint));
}
return sb.toString();
}
public boolean isUnique(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
if (charSet.contains(s.charAt(i))) {
return false;
}
charSet.add(s.charAt(i));
}
return true;
}
public boolean isUnique2(String s) {
for (int i = 0; i < s.length() - 1; i++) {
for (int j = i + 1; j < s.length(); j++) {
if (s.charAt(i) == s.charAt(j)) {
return false;
}
}
}
return true;
}
public static void main(String[] args) {
FindDuplicatesTest app = new FindDuplicatesTest();
long start = System.nanoTime();
boolean result = app.isUnique(s);
long stop = System.nanoTime();
System.out.println(result);
System.out.println("HashSet Search Time: " + (stop - start));
start = System.nanoTime();
result = app.isUnique2(s);
stop = System.nanoTime();
System.out.println(result);
System.out.println("Array Search Time: " + (stop - start));
}
}
Upvotes: 0
Reputation: 9450
Micro benchmarking that you are doing can give very misleading info about algorithm complexities.
It's easy to "port" your algorithms to check for duplicates in, say, array of Integers.
Then I recommend testing performance on say, array of 10^7 elements and you will definitely see the difference.
This way you'd be able to confirm your initially correct estimation O(N) for hashset vs O(N^2) for the second "loop" version.
Upvotes: 1
Reputation: 8147
when using O() notation you ignore constants, which means that O(n) == (10^10*n). so while O(n^2)>O(n) is true asymptotically, its not necessarily true for smaller values of n. in your case imagine that maybe resizing the array behind the hashset could be more time consuming than iterating the input.
Upvotes: 2
Reputation: 133
Assuming that the charAt method runs in O(1) time, the first algorithm is O(N) and the second is O(N^2). A linear time algorithm is not supposed to be faster than a quadratic algorithm for all inputs. It will be faster than the quadratic one after a certain N (which could possibly be in the millions).
for example:
void funcA(int n){
for (int i = 0; i < n; i++){
for (int j = 0; j < 10000; j++){
int k = i + j;
}
}
}
void funcB(int n){
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
int k = i + j;
}
}
}
even though funcA is linear and funcB is quadratic, it is easy to see that funcB will be faster than funcA for n < 10000. In your case the hashSet requires time to compute the hash and so may be slower for inputs of a certain size.
Upvotes: 1