Reputation: 2710
This Question is based on the code at: http://nlp.stanford.edu/projects/glove/
The below code, behaves as I would expect. It echo's back User Input from: stdin
.
stdin: The standard input stream is the default source of data for applications. In most systems, it is usually directed by default to the keyboard.
Type in text and press enter, that text echo's back to the console. Normal, and expected.
// _CRT_SECURE_NO_WARNINGS:
#pragma warning(disable : 4996)
#include <stdio.h>
#include <stdlib.h>
int main()
{
// Char as int:
int ch;
// File Pointer:
FILE *fid;
// Open the File: Stream
fid = fopen("<Path to simple text file>/text.txt", "r");
// Loop through Chars:
while (1)
{
// Check valid Stream:
if (fid == NULL)
{
printf("Stream Error: File was not opened!\n");
break;
}
// If EOF:
if (feof(fid))
{
break;
}
// Get C:
ch = fgetc(fid);
// Print C:
printf("%c", ch);
}
// Close the File:
fclose(fid);
// Open the File: Stream
fid = stdin;
// Loop through Chars:
while (1)
{
// Check valid Stream:
if (fid == NULL)
{
printf("Stream Error: File was not opened!\n");
break;
}
// If EOF:
if (feof(fid))
{
break;
}
// Get C:
ch = fgetc(fid);
// Print C:
printf("%c", ch);
}
int i = 0;
return i;
}
EXAMPLE Source code: http://nlp.stanford.edu/projects/glove/ - Specifically from Line 301 of the coocur.c Code File
In this code:
fid = fopen(vocab_file,"r");
if(fid == NULL) {fprintf(stderr,"Unable to open vocab file %s.\n",vocab_file); return 1;}
while(fscanf(fid, format, str, &id) != EOF) hashinsert(vocab_hash, str, ++j); // Here id is not used: inserting vocab words into hash table with their frequency rank, j
fclose(fid);
vocab_size = j;
j = 0;
if(verbose > 1) fprintf(stderr, "loaded %lld words.\nBuilding lookup table...", vocab_size);
/* Build auxiliary lookup table used to index into bigram_table */
lookup = (long long *)calloc( vocab_size + 1, sizeof(long long) );
if (lookup == NULL) {
fprintf(stderr, "Couldn't allocate memory!");
return 1;
}
lookup[0] = 1;
for(a = 1; a <= vocab_size; a++) {
if((lookup[a] = max_product / a) < vocab_size) lookup[a] += lookup[a-1];
else lookup[a] = lookup[a-1] + vocab_size;
}
if(verbose > 1) fprintf(stderr, "table contains %lld elements.\n",lookup[a-1]);
/* Allocate memory for full array which will store all cooccurrence counts for words whose product of frequency ranks is less than max_product */
bigram_table = (real *)calloc( lookup[a-1] , sizeof(real) );
if (bigram_table == NULL) {
fprintf(stderr, "Couldn't allocate memory!");
return 1;
}
fid = stdin; // <<<--- STDIN Stream Redirect
sprintf(format,"%%%ds",MAX_STRING_LENGTH);
sprintf(filename,"%s_%04d.bin",file_head, fidcounter);
foverflow = fopen(filename,"w");
if(verbose > 1) fprintf(stderr,"Processing token: 0");
/* For each token in input stream, calculate a weighted cooccurrence sum within window_size */
while (1) {
if(ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
flag = get_word(str, fid); // <<<--- Reading from the Vocab, not STDIN
if(feof(fid)) break;
if(flag == 1) {j = 0; continue;} // Newline, reset line index (j)
counter++;
if((counter%100000) == 0) if(verbose > 1) fprintf(stderr,"\033[19G%lld",counter);
htmp = hashsearch(vocab_hash, str); // <<<--- Using the str that was read in the function: 'get_word'
if (htmp == NULL) continue; // Skip out-of-vocabulary words
w2 = htmp->id; // Target word (frequency rank)
for(k = j - 1; k >= ( (j > window_size) ? j - window_size : 0 ); k--) { // Iterate over all words to the left of target word, but not past beginning of line
w1 = history[k % window_size]; // Context word (frequency rank)
if ( w1 < max_product/w2 ) { // Product is small enough to store in a full array
bigram_table[lookup[w1-1] + w2 - 2] += 1.0/((real)(j-k)); // Weight by inverse of distance between words
if(symmetric > 0) bigram_table[lookup[w2-1] + w1 - 2] += 1.0/((real)(j-k)); // If symmetric context is used, exchange roles of w2 and w1 (ie look at right context too)
}
else { // Product is too big, data is likely to be sparse. Store these entries in a temporary buffer to be sorted, merged (accumulated), and written to file when it gets full.
cr[ind].word1 = w1;
cr[ind].word2 = w2;
cr[ind].val = 1.0/((real)(j-k));
ind++; // Keep track of how full temporary buffer is
if(symmetric > 0) { // Symmetric context
cr[ind].word1 = w2;
cr[ind].word2 = w1;
cr[ind].val = 1.0/((real)(j-k));
ind++;
}
}
}
I would like to know, how exactly, a word is assigned to str
in the method: flag = get_word(str, fid);
after the stream has been changed to stdin
, which is then used two lines later: htmp = hashsearch(vocab_hash, str);
This Code does many millions of iterations over large Corpora, a user does not sit there and type in each word manually.
I would very much appreciate if someone could explain how this is occurring, after the: fid = stdin;
Stream change.
Upvotes: 1
Views: 1019
Reputation: 2710
Simple to some, but not so for others...
stdin
is a default Input Stream and thus in code can be directly accessed by using the stdin
variable. So when one see's, and I have on a few occasions now:
FILE *fid;
fid = stdin;
The stdin
stream has been redirected: "Somewhere" if this stream is not the default. Normally on most machines, the default is the Keyboard.
On line: 301 fid = fopen(vocab_file,"r");
the vocab file becomes the stream data source, which is returned by the fopen
function. The file is read and processed.
On line: 304 the stream is closed: fclose(fid);
On line: 329 fid = stdin;
stdin
is assigned as the input stream for fid
.
From there, there is no sign of Stream Change, but there is assignment to str
, this is from one of the text files, and the method: get_word
assigns str
from the corpora...
The command line input is the answer: -overflow-file tempoverflow < corpus.txt > cooccurrences.bin
./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrences.bin
Using : cplusplus.com
Standard input stream
The standard input stream is the default source of data for applications. In most systems, it is usually directed by default to the keyboard.
stdin can be used as an argument for any function that expects an input stream (FILE*) as one of its parameters, like fgets or fscanf.
Although it is commonly assumed that the source of data for stdin is going to be a keyboard, this may not be the case even in regular console systems, since stdin can generally be redirected on most operating systems at the time of invoking the application. For example, many systems, among them DOS/Windows and most UNIX shells, support the following command syntax:
myapplication < example.txt
to use the content of the file example.txt as the primary source of data for myapplication instead of the console keyboard.
It is also possible to redirect stdin to some other source of data from within a program by using the freopen function.
If stdin is known to not refer to an interactive device, the stream is fully buffered. Otherwise, it is library-dependent whether the stream is line buffered or not buffered by default (see setvbuf).
So there you go, the stdin
stream is redirected by the command line argument: -overflow-file tempoverflow < corpus.txt
As a result: corpus.txt
is the redirected data source of the stdin
Stream!
Also worth noting, cooccurrences.bin
is the redirected data source of the stdout
Stream - vial Line: 232 fout = stdout;
and written to on line: 270 fwrite(&old, sizeof(CREC), 1, fout);
For more information on: "Standard Input and Output Redirection"
NOTE: If you want to run this code up, remember to set the Console App to 64 Bit - It wont allocate Memory other wise!
Upvotes: 1