Reputation: 161
I'm trying to create an inverted index of words and their placements in a given corpus of documents. An example of the data structure I'm aiming for is something like:
+----------+--------------------------------------------------------------+
| Word | Location |
+----------+--------------------------------------------------------------+
| 'word 1' | 'doc1' 'title', 'doc4' 'text', 'doc7' 'title' 'text' |
+----------+--------------------------------------------------------------+
Where 'title' and 'text' are the possible locations. The above table means that 'word 1' can be found in the title of doc1, the text of doc4, and both the title and the text of doc7.
My code to parse and generate the data is:
while (my $line = <$fh>) {
# determine doc no and location within docs
....
#iterate words in a given location within a document
foreach my $str ($line =~ /[[:alpha:]]+/g) {
push @{ $doc{$docno} }, $location;
push @{ $wordlist{$str} }, $doc{$docno};
}
}
While my code to print the data is:
foreach my $str (reverse sort { $wordlist{$a} <=> $wordlist{$b} } keys %wordlist) {
printf $fo "%-15s %-15s \n", $str, "@{ $wordlist{$str} }";
}
However, the result is:
+----------+--------------------------------------------------------------+
| Word | Location |
+----------+--------------------------------------------------------------+
| 'word1' | ARRAY(0x66d4508) ARRAY(0x66d4508) ARRAY(0x66d4508) |
+----------+--------------------------------------------------------------+
Where did I go wrong?
Edit:
I tried changing the printing code to:
foreach my $str (reverse sort { $wordlist{$a} <=> $wordlist{$b} } keys %wordlist) {
printf "%-15s", $str;
@arr = @{ $wordlist{$str} };
foreach $arr (@arr)
{
print "@{ $arr }: , ";
}
print "\n";
}
But the result is:
word101 title title text text text text text text ...
I can't figure out how to print the document number alongside the location within said document
Upvotes: 1
Views: 61
Reputation: 11813
Your data structure threw the information you're after away.
Just do this:
while (my $line = <$fh>) {
# determine doc no and location within docs
....
#iterate words in a given location within a document
foreach my $str ($line =~ /[[:alpha:]]+/g) {
push $worldlist{Sstr}->@*, {
docno => $docno,
location => $location
};
}
}
This makes the job of printing out your data structure trivial.
Upvotes: 1