Reputation: 48916
I have the following text:
Grier et al. (1983) reported father and 2 sons with typical Aarskog
syndrome, including short stature, hypertelorism, and shawl scrotum.
They tabulated the findings in 82 previous cases. X-linked recessive
inheritance has repeatedly been suggested (see 305400). The family
reported by Welch (1974) had affected males in 3 consecutive
generations. Thus, there is either genetic heterogeneity or this is an
autosomal dominant with strong sex-influence and possibly ascertainment
bias resulting from use of the shawl scrotum as a main criterion.
Stretchable skin was present in the cases of Grier et al. (1983).
I'm trying to return the list of words in the text above.
I did something as follows:
input_file.read.downcase.scan(/\b[a-z]\b/) {|word| frequency[word] = frequency[word] + 1}
I get the letters (i.e. a
, b
, c
, ..., z
) and their frequency in the document, and not the words. Why is that? And, how can I get the words instead of only standalone letters?
Upvotes: 1
Views: 71
Reputation: 160551
I'd do it like this:
text = 'Foo. (1983). Bar baz foo bar.'
text.downcase
# => "foo. (1983). bar baz foo bar."
downcase
folds the text to lower-case to make it easy to find matches for the words regardless of case.
text.downcase.gsub(/[^a-z ]+/i, '')
# => "foo bar baz foo bar"
gsub(/[^a-z ]+/i, '')
removes characters that aren't part of words, like punctuation and numbers.
text.downcase.gsub(/[^a-z ]+/i, '').split
# => ["foo", "bar", "baz", "foo", "bar"]
split
will break a string into "words" that are space-delimited.
text.downcase.gsub(/[^a-z ]+/i, '').split.each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }
# => {"foo"=>2, "bar"=>2, "baz"=>1}
each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }
is how to walk through an array and count the frequency of elements. Hash.new{ |h,k| h[k] = 0}
is how to define a hash that will automatically create 0
values for keys that don't exist.
With all that in mind:
text = 'Grier et al. (1983) reported father and 2 sons with typical Aarskog syndrome, including short stature, hypertelorism, and shawl scrotum. They tabulated the findings in 82 previous cases. X-linked recessive inheritance has repeatedly been suggested (see 305400). The family reported by Welch (1974) had affected males in 3 consecutive generations. Thus, there is either genetic heterogeneity or this is an autosomal dominant with strong sex-influence and possibly ascertainment bias resulting from use of the shawl scrotum as a main criterion. Stretchable skin was present in the cases of Grier et al. (1983).'
text.downcase
.gsub(/[^a-z ]+/i, '')
.split
.each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }
Which results in:
# => {"grier"=>2, "et"=>2, "al"=>2, "reported"=>2, "father"=>1, "and"=>3, "sons"=>1, "with"=>2, "typical"=>1, "aarskog"=>1, "syndrome"=>1, "including"=>1, "short"=>1, "stature"=>1, "hypertelorism"=>1, "shawl"=>2, "scrotum"=>2, "they"=>1, "tabulated"=>1, "the"=>4, "findings"=>1, "in"=>3, "previous"=>1, "cases"=>2, "xlinked"=>1, "recessive"=>1, "inheritance"=>1, "has"=>1, "repeatedly"=>1, "been"=>1, "suggested"=>1, "see"=>1, "family"=>1, "by"=>1, "welch"=>1, "had"=>1, "affected"=>1, "males"=>1,...
If you insist on using a regex and scan
:
text.downcase
.scan(/\b [a-z]+ \b/x)
.each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }
# => {"grier"=>2, "et"=>2, "al"=>2, "reported"=>2, "father"=>1, "and"=>3, "sons"=>1, "with"=>2, "typical"=>1, "aarskog"=>1, "syndrome"=>1, "including"=>1, "short"=>1, "stature"=>1, "hypertelorism"=>1, "shawl"=>2, "scrotum"=>2, "they"=>1, "tabulated"=>1, "the"=>4, "findings"=>1, "in"=>3, "previous"=>1, "cases"=>2, "x"=>1, "linked"=>1, "recessive"=>1, "inheritance"=>1, "has"=>1, "repeatedly"=>1, "been"=>1, "suggested"=>1, "see"=>1, "family"=>1, "by"=>1, "welch"=>1, "had"=>1, "affected"=>1, ...
The difference really is that gsub().split
is faster than scan(/\b [a-z]+ \b/x)
.
Upvotes: 1
Reputation: 319
http://rubular.com is a great resource.
\b[a-z]\b
says any single character between two word boundaries.
If you would like to allow for multiple characters use this: \b[a-z]+\b
That says any one or more letters between two word boundaries.
Upvotes: 3