Reputation: 477
This will probably sound dumb but I am trying to make an efficient program (time/memory wise) and studying hash tables I have seen that it's basically an array of linked lists, that start filling up when all the spots in the table are taken. And this starts demanding memory and time because they need mallocs to store data and time to search an element, unlike an array; this is all because arrays are not dynamic and have a limit
So I was just wondering, why can't I make a 20 billion long array so I can access in O(1) using the indexes and there is no need for mallocs? Like, a huge array in the main and that's all
i need to save a text as a bunch of lines, so i know where each line is (line 1 will be the first, duh) and it did seem unnecessary to me to use hash table but the problem is I don't know how many lines it will have so if i make an array of 50 maybe it's not enough, I am wondering if it is better to use a list/hash table/ some other structure or just an array of char arrays
Upvotes: 1
Views: 1780
Reputation: 123458
This will probably sound dumb but I am trying to make an efficient program (time/memory wise)
Efficient program to do what? You never really say what you're trying to do.
studying hash tables I have seen that it's basically an array of linked lists
That's a common implementation, but it doesn't speak to why you'd use a hash table in the first place.
You use a hash table when you're searching for a record based on a non-numeric key (i.e., a string). You feed that key into a hashing function that spits out an integer value, and you use that value to index into the table. So if f("foo")
spits out 3
, that's the table index you use to store your data with the key "foo"
.
No practical hashing function is perfect and different strings can result in the same hash value, known as a collision. Using a linked list is one way to resolve collisions, other ways are to compute a secondary index into the table or just add 1 to the returned index.
Computing a hash from the key is fast relative to a linear or binary search, giving a time complexity of O(1) compared to a linear search time of O(n) and a binary search time of O(log2 n). The tradeoff is that your table isn't ordered in any way - a linear traverse will appear randomly ordered.
Edit
From the comment:
i need to save a text as a bunch of lines, so i know where each line is (line 1 will be the first, duh) and it did seem unnecessary to me to use hash table but the problem is I don't know how many lines it will have so if i make an array of 50 maybe it's not enough, I am wondering if it is better to use a list/hash table/ some other structure or just an array of char arrays (added in post too)
If you just need to store a sequence of strings, you can dynamically allocate an array and then extend that array as necessary. Assuming all the lines are of a known fixed length, you can do something like this (reads the file into memory, dumps the contents to standard output):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LENGTH 80
size_t read_arr( FILE *in, char (**arr)[MAX_LENGTH+1] )
{
size_t size = 1;
size_t read = 0;
*arr = malloc( sizeof **arr * size );
if ( !*arr )
{
return 0;
}
char buf[MAX_LENGTH+1];
while( fgets( buf, sizeof buf, in ) )
{
if ( read == size )
{
char (*tmp)[MAX_LENGTH+1] = realloc( *arr, sizeof **arr * (size * 2) );
if ( tmp )
{
*arr = tmp;
size *= 2;
}
else
{
fprintf( stderr, "Unable to extend array past %zu entries, returning what we have so far.\n", read );
return read;
}
}
strcpy( (*arr)[read++], buf );
}
return read;
}
int main( int argc, char **argv )
{
if ( argc < 2 )
{
fprintf( stderr, "USAGE: %s <file name>\n", argv[0] );
return EXIT_FAILURE;
}
FILE *in = fopen( argv[1], "r" );
if ( !in )
{
fprintf( stderr, "File %s not found\n", argv[0] );
return EXIT_FAILURE;
}
char (*arr)[MAX_LENGTH+1];
size_t n = read_arr( in, &arr );
for ( size_t i = 0; i < n; i++ )
printf( "%s", arr[i] );
free( arr );
return EXIT_SUCCESS;
}
realloc
is a relatively expensive operation, so you don't want to do it for every line in the file. Doubling the array each time minimizes the number of calls, although the tradeoff is the possibility of internal fragmentation (needing 256 rows to store 129 lines, for example). But on average that shouldn't be a problem.
could you tell me what is
char (**arr)[MAX_LENGTH+1]
, I've never seen this structure; is it a 2d array?
Yeah, I guess I should explain that.
T (*a)[N];
declares a
as a pointer to an N-element array of T
. An expression of type T [M][N]
will "decay" to type T (*)[N]
(not T **
).
I want to dynamically allocate enough space to store M objects of type T [N]
. So we start with the common idiom
P *p = malloc( sizeof *p * M );
sizeof *p
is equivalent to sizeof (P)
, so we're allocating enough space to store M objects of type P
. Now we replace the type P
with the array type T [N]
, which gives us
T (*p)[N] = malloc( sizeof *p * M );
In this case, sizeof *p
is equivalent to sizeof (T [N])
, so we're allocating enough space to store M N-element arrays of T
.
Since a[i]
is defined as *(a + i)
, the following is true:
(*p)[i] == (*(p + 0))[i] == (p[0])[i] == p[0][i]
So we can index into p
like any other 2D array.
So in the main
function above, I declare arr
as a pointer to a MAX_LENGTH+1
array of char
. Since I want read_arr
to update the value stored in arr
itself (the address to the allocated memory), I need to pass a pointer to arr
. Remember, if you want a function to update one of its parameters, you must pass a pointer to that parameter1, even if that parameter is already a pointer type. If the type of arr
is char (*)[MAX_LENGTH+1]
, then the type of &arr
is char (**)[MAX_LENGTH+1]
, or "pointer to pointer to MAX_LENGTH+1
-element array of char
".
And again, this assumes that all the lines in the file are close to the same length, and that they're all less than some known maximum length. If you have a file where the lines have wildly different lengths, or 99% have length 20 and one or two have length 200, then you'll want to do something else.
Upvotes: 2
Reputation: 58848
It's actually very simple.
The reason you can't use a 20 billion long array is because it will take 80 gigabytes of RAM. 160 gigabytes if it's 64-bit.
Also you can't index them with strings. myArray["hello"] = "world";
is never ever ever ever going to work.
Upvotes: 1
Reputation:
You can. If you have that much memory. But from a statistical point of view a hashtable will be just as good. They have O(1) lookup on average. This may be hard to realize how and why, so my tip would be to try implementing your own just to learn. Also, malloc calls shouldn't be slow on a good OS.
Upvotes: 1
Reputation: 25266
A hash table is a data structure that allows for fast retrieval of a data item given a key.
A hash table is indexed using a hash of the key. It maps the key onto an integer that is used to index the table. The hash function is generally very fast.
The hash table is generally much smaller than the number of possible keys. Because of this, the hash function can generate the same index for more than one distinct key. This is called a collision. To handle collisions, a hash table entry generally has a linked list (but a balanced binary tree could also be possible). The list stores all the collisions of the hash table entry.
So given a key, the hash function determines the index into the table and then the list of that entry is searched for the actual key and then the data associated with that key is retrieved. As you see, this is much faster than searching a linked list, and uses a lot less memory than an array that has an entry for every possible key.
There is some overhead in maintaining the table and list, but the main gain is the fast data retrieval.
Designing hash functions is a science in itself.
Note: so a hash table consists of two data structures: the hash table itself with its size and hash function, and the hash table entry which could be a list, a sorted list, a tree or anything else.
Upvotes: 1
Reputation: 477
Using an array is useful if you already know the index of every element to search for, to reduce time for all operations.
On the memory part, because arrays have fixed dimension it all comes down if you know the total data you have to save or not; because if "just to be sure" you make an array with 20 billion indexes and then end up using only the first 100 it is quite a poor choice compared to using dynamic memory, which can extend itself only in case more space is needed.
Upvotes: 0