Reputation: 77
I have made a little project that takes .cs files, reads them and returns the most frequent word in the file. However, right now it returns that the most common word is a comma. How can i make it so splitting the string ignores commas?
For example: i have a string:
, . ? a a, b cdef cfed, abef abef abef,
right now it returns that the most common word is 'abef' and it occured 2 times (the program doesn't count the third abef, the one which is with a comma in the end.)
Another example:
, . ? a a, b cdef cfed, abef abef abef, , ,
this right now returns that the most common word is a comma ',' and it occured 3 times, but the thing is - i want my program to ignore commas and focus purely on words only.
namespace WindowsFormsApp8
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private async void button1_Click(object sender, EventArgs e)
{
using (OpenFileDialog ofd = new OpenFileDialog() { Filter = "Text Documents |*.cs;*.txt", ValidateNames = true, Multiselect = false }) //openfiledialog (all .cs; all.txt)
{
if (ofd.ShowDialog() == DialogResult.OK) //if in file dialog a file gets selected
{
using (StreamReader sr = new StreamReader(ofd.FileName)) //text reader
{
richTextBox1.Text = await sr.ReadToEndAsync(); //reads the file and returns it into textbox
}
}
}
}
private void button2_Click(object sender, EventArgs e)
{
string[] userText = richTextBox1.Text.ToLower().Split( ' ' );
var frequencies = new Dictionary<string, int>(); // variable frequencies, dictionary with key string, value int.
string highestWord = null; //declare string highestword with starting value null.
int highestFreq = 0; //declare integer highestfreq with starting value zero.
foreach (string word in userText) //search words in our array userText that we declared at the beginning.
{
int freq; //declare integer freq.
frequencies.TryGetValue(word, out freq); //trygetvalue from dictionary key, out value.
freq += 1; //count it.
if (freq > highestFreq)
{
highestFreq = freq;
highestWord = word;
}
frequencies[word] = freq; //assign most frequent word in frequencies dictionary to freq
}
MessageBox.Show("the most occuring word is: " + highestWord + ", it occured " + highestFreq + " times"); //display data to messagebox.
}
}
}
Upvotes: 0
Views: 460
Reputation: 216273
Split can take an array of chars to split on. So you can split on space and comma. Then remove the empty entries with the appropriate StringSplitOption
string[] userText = richTextBox1.Text.ToLower().Split(new char[] { ' ', ','}, StringSplitOptions.RemoveEmptyEntries );
Also you can use Linq to calculate the frequency of a word with code like this
var g = userText.GroupBy(x => x)
.Select(z => new
{ word = z.Key, count = z.Count()})
.ToList();
string mostUsed = g.OrderByDescending(x => x.count)
.Select(x => x.word)
.FirstOrDefault();
Upvotes: 6
Reputation: 23078
Another option is to make the splitting easier to extend by using regular expressions, Regex.Split more specifically:
string input = ", . ? a a, b cdef cfed, abef abef abef, , ,";
string[] result = Regex.Split(input, @"\w+");
Check live testing here.
If ?
is a valid word, than the regex could be @"\w+|\?"
.
So, my recommendation is to use regex, even if the split method is enough for now, since it is more powerful and can easily accommodate for later changes.
As a bonus, it is nice to learn about regular expressions.
Upvotes: 2
Reputation: 1200
You could replace the commas with an empty string, then run the output through your algorithm.
string original = ", . ? a a, b cdef cfed, abef abef abef,";
string noCommas = original.Replace(",", string.Empty);
Reference: https://msdn.microsoft.com/en-us/library/fk49wtc1(v=vs.110).aspx
Upvotes: 3