Reputation: 38
I have created a dictionary, and created code to read a txt file, and input each word from the file into the dictionary.
//Set up OpenFileDialog box, and prompt user to select file to open
DialogResult openFileResult;
OpenFileDialog file = new OpenFileDialog() ;
file.Filter = "txt files (*.txt)|*.txt";
openFileResult = file.ShowDialog();
if (openFileResult == DialogResult.OK)
{
//If user selected file successfully opened
//Reset form
this.Controls.Clear();
this.InitializeComponent();
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
//Reset variables, and set-up chart
ResetVariables();
ChartInitialize();
foreach (string s in wordDictionary.Keys)
{
//Calculate statistics from dictionary
ComputeStatistics(s);
if (dpCount < 50)
{
AddToGraph(s);
}
}
//Print statistics
PrintStatistics();
}
And the AddToDictionary(s)
function is:
public void AddToDictionary(string s)
{
//Function to add string to dictionary
string wordLower = s.ToLower();
if (wordDictionary.ContainsKey(wordLower))
{
int wordCount = wordDictionary[wordLower];
wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
}
else
{
wordDictionary.Add(wordLower, 1);
txtUnique.Text += wordLower + ", ";
}
}
The text file being read by this program is:
To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them To die to sleep
No more and by a sleep to say we end
The heartache and the thousand natural shocks
That flesh is heir to Tis a consummation
Devoutly to be wished To die to sleep
To sleep perchance to dream ay theres the rub
For in that sleep of death what dreams may come
When we **have** shuffled off this mortal coil
Must give us pause Theres the respect
That makes calamity of so long life
For who would bear the whips and scorns of time
The oppressors wrong the proud mans contumely
The pangs of despised love the laws delay
The insolence of office and the spurns
That patient merit of th unworthy takes
When he himself might his quietus make
With a bare bodkin Who would fardels bear
To grunt and sweat under a weary life
But that the dread of something after death
The undiscovered country from whose bourn
No traveller returns puzzles the will
And makes us rather bear those ills we **have**
Than fly to others that we know not of
Thus conscience does make cowards of us all
And thus the native hue of resolution
Is sicklied oer with the pale cast of thought
And enterprise of great pitch and moment
With this regard their currents turn awry
And lose the name of action Soft you now
The fair Ophelia Nymph in thy orisons
Be all my sins remembered
The problem I am encountering is that the word "have" is appearing twice in the dictionary. I know this doesn't happen with dictionaries, but for some reason it is appearing twice. Does anyone know why this would happen?
Upvotes: 1
Views: 101
Reputation: 117027
I would be inclined to change the following code:
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
to this:
wordDictionary =
File
.ReadAllLines(file)
.SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
.Select(x => x.ToLower())
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
This completely avoids the issues with line endings and also has the added advantage that it doesn't leave any undisposed streams lying around.
Upvotes: 0
Reputation: 100527
Splitting text into words is generally hard solution if you want to support multiple languages. Regular expressions are generally better in dealing with parsing than basic String.Split
.
I.e. in your case you are picking up variations of "new line" as part of a word, you also could be picking up things like non-breaking space,...
Following code will pick words better than your current .Split
, for more info - How do I split a phrase into words using Regex in C#
var words = Regex.Split(line, @"\W+").ToList();
Additionally you should make sure your dictionary is case insensitive like following (pick comparer based on your needs, there are culture-aware once too):
var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);
Upvotes: 0
Reputation: 12270
If you run:
var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();
Inspecting words
in the debugger shows that some instances of "test" have acquired a \r
due to the two byte CRLF line terminator - which isn't treated by the split.
To fix, change your split to:
Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
Upvotes: 3