Reputation: 191
I'm trying to create some code to fish out records from a list of about 200k to 1million records. Obviously, I would want this process to be as fast as possible. The basic idea is as follows, the records in the large list are a combination of numbers which are to be kept together. For Example:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400076,400097,800076,800097
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,200032,200078,500032,500078
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,300043,300083,600043,600083
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,600026,600077,900026,900077
0,0,0,0,0,0,0,0,0,0,0,0,0,0,100008,100028,400028,400056,600008,600056
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400042,400098,500042,500098
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,86,500015,500086
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400013,400076,800013,800076
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,700024,700083,900024,900083
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100003,100047,800003,800047
The maximum length of the record is 20 which is why the additional zeroes. Let's not worry about these for a moment. So, I want to "fish" out some records such that no repetitions are observed. If one repetition is there, I can discard that record and no longer look at it further. Thus, I must compile a list which looks like this:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400076,400097,800076,800097
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,200032,200078,500032,500078
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,300043,300083,600043,600083
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,600026,600077,900026,900077
0,0,0,0,0,0,0,0,0,0,0,0,0,0,100008,100028,400028,400056,600008,600056
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400042,400098,500042,500098
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,86,500015,500086
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,700024,700083,900024,900083
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100003,100047,800003,800047
Note how in the above list, record no. 8 is missing because the number 400076 already exists in a previous record.
The code I am using to do this is as follows:
void Make_List(ConfigList *pathgroups, ConfigList *configlist)
{
int i,j,k,l,flag,pg_num=0,len,p_num=0;
for(i = 0;i<configlist->num_total;i++)
{
flag = 0;
for(j = configlist->configsize-1;j>=0;j--)
{
if(configlist->pathid[i][j])
{
for(k = 0;k<pg_num;k++)
{
for(l = pathgroups->configsize-1;l>=0;l--)
{
if(pathgroups->pathid[k][l])
{
if(configlist->pathid[i][j]==pathgroups->pathid[k][l])
{
flag++;
break;
}
}
else
{
break;
}
}
if(flag)
{
break;
}
}
}
else
{
break;
}
if(flag)
{
break;
}
}
if(!flag)
{
len = 0;
for(j = configlist->configsize-1;j>=0;j--)
{
pathgroups->pathid[pg_num][j]=configlist->pathid[i][j];
if(configlist->pathid[i][j])
{
len++;
}
}
pg_num++;
p_num+=len;
if(p_num>=totpaths)
{
break;
}
}
}
Print_ConfigList(stderr,pathgroups);
}
The structure ConfigList basically stores the 2D array along with other things used in different parts of the program.
num_total
tells us the number of rows in the array whereas configsize
tells us the number of columns in the array.
totpaths
is a breakpoint which terminates the loop early in case assignment is completely finished.
Upvotes: 1
Views: 86
Reputation: 27702
Checking if each element is repeated for each new element analyzed has a computational cost of O(N^2)
which, given your large input set, is far too much.
Basically, what you need is a fast access data-structure where you can keep a count of how many times your record has appeared or at least a boolean flag.
The easiest way to do this is to have an array where the position represent each possible value and the array value the count of times the position value has appeared (or its boolean value of existence). However, if your data range is too large you can do this because the memory used to store the array is proportional to the range size.
The alternative to avoid that is to use Hash tables or sets.
As you has established in your comments above, your integer range is [0,99999999]
so if you wanted to use a vector to keep track of the presence or not of each single value you would need approximately 96 MB
to store it in memory.
This is an example using byte arrays:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_IN_RANGE 99999999
int main()
{
char * isInInput = (char*)malloc(MAX_IN_RANGE+1);
memset(isInInput,0,MAX_IN_RANGE+1);
size_t i;
int inputExample[] = {1,3,5,2,1,5};
for(i = 0; i < 6; i++)
{
int value = inputExample[i];
printf("%d\n",value);
if(!isInInput[value])
{
printf("Add value %d to your collection\n", value);
isInInput[value] = 1;
}
else
{
printf("%d is repeated\n", value);
}
}
free(isInInput);
}
To use hash tables instead you can rely on libraries such as Judy in order to avoid implementing your own hash table.
Upvotes: 2