Reputation: 367
I need to create a method that reads a html file then display the number of word occurrence.
for example: String [] words = {"happy", "nice", "good"};
The word happy was used 7 times. The word nice was used 1 times. The word happy was used 2 times.
This is what I did:
public static void ReadWriteDisplay() {
Path in = Paths.get("E:\\TextToHTML.html");
Path out = Paths.get("E:\\HTMLToText.txt");
String s = "";
String str = "";
try {
InputStream input = new BufferedInputStream(Files.newInputStream(in));
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
OutputStream output = new BufferedOutputStream(Files.newOutputStream(out, CREATE, WRITE, TRUNCATE_EXISTING));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output));
s = reader.readLine();
while(s != null) {
str += s;
writer.write(s);
writer.newLine();
s = reader.readLine();
}
reader.close();
writer.close();
String a[] = str.split(" ");
System.out.println("str: "+str);
String [] positive = {"happy", "nice", "good", "joy", "love"};
int [] count = {0, 0, 0, 0, 0};
for (int i = 0; i < a.length; i++) {
if(positive[0].equalsIgnoreCase(a[i]))
count[0]++;
if(positive[1].equalsIgnoreCase(a[i]))
count[1]++;
if(positive[2].equalsIgnoreCase(a[i]))
count[2]++;
if(positive[3].equalsIgnoreCase(a[i]))
count[3]++;
if(positive[4].equalsIgnoreCase(a[i]))
count[4]++;
}
for (int x = 0; x < 5; x++) {
System.out.println("The word "+positive[x]+" was used "+count[x]+" times.");
}
} catch(Exception e) {
System.err.println("Message: "+ e);
}
}
My method runs but it does not provide accurate number of occurrence. The reason because some words in html are enclosed in <> which caused <>Hello<> to be stored in my string array instead of the word Hello.
Here is the sample output:
str: <!DOCTYPE html><html lang="en"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="content-language" content="en" /> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="google-site-verification" content="rUp8isOBygjhxPJ2qyy6QtBi9vWRFhIboMXucJsCtrE" /> <title>JustPaste.it - Share Text & Images the Easy Way</title> <link rel="preload" href="/static/img/jp_logo_1_en_v4.png" as="image" /> <meta name="robots" content="noindex, nofollow" /> <meta name="googlebot" content="noindex, nofollow" /> <link rel="preload" href="/build/global.395f53d0.css" as="style" /> <link rel="stylesheet" type="text/css" href="/build/global.395f53d0.css" /> <link rel="shortcut icon" href="/static/other/fav.ico" /> <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> <!-- WARNING: Respond.js doesn't work if you view the page via file:// --> <!--[if lt IE 9]> <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> <![endif]--> <script> window.article = {"id":42017684,"url":"https:\/\/justpaste.it\/6fn9m","shortUrl":"https:\/\/jpst.it\/2wiek","pdfUrl":"https:\/\/justpaste.it\/6fn9m\/pdf","qrCodeData":"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFcAAABXCAIAAAD+qk47AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACCklEQVR4nO2by27DMAwEx0X\/\/5fTAwFdaNB8SEmB7BzjSDEWy4ikpOv1evH1\/Hz6Bf4FUgGkgiEVQCoYv\/6j67omM65FJzOPX6HWKD9PaebSj8oLIBWMm4hYlBIq79Jg+Pqyd3vpR4dvuJAXQCoYUUQsAi9lPOlt74dnloZzbygvgFQwUhExpJft9EKjh7wAUsF4R0QE+Bh5g\/898gJIBSMVEUNzDjOiDMN55AWQCkYUEcOWTqlrtL18KCEvgFQwbiJie7qSMXkpELa\/obwAUsFI7UcEpXHw397bmMh0cXtJVzBKXgCpYFyB3xYlT\/Ye3bzZ7q264EflBZAKRmqHLmPyYJR\/5IeXEqrt8SgvgFQwojoiY9feEpN5VCLo4maQF0AqGLVzTcM\/50UpEdpVj+sUxwNSAao7dJk6erHrhN65umYhL4BUMGoRUTJ56TsBw\/UoM0peAKlg1CrrRamgLnEu6VLW9IBUgLj7Ouz\/DJePHr16RF4AqWA096yDc92lCXs3hjzDyJIXQCoYB+\/Q9Q4vDS9cBPOojnhAKsDRO3R+nl3dp94uhrKmB6QCHL1Dlznp1GsWbUdeAKlgvOPGUK8juqt5mymx5QWQCsbBiCglS5+9KCEvgFQwDt6hO3djdHtfV14AqWAcvEO36B1M6mVNvQpFXgCpYNzs0H0h8gJIBUMqgFQwpALAH\/JvmLtnlWjnAAAAAElFTkSuQmCC"}; window.statsUrl = 'https\u003A\/\/stats.justpaste.it'; window.viewKey = 'x6ER'; window.barOptions = {"isLoggedIn":false,"hasPublicProfile":false,"displayOwnership":false,"isArticleOwner":false,"isPasswordProtected":false,"isCaptchaRequired":null,"isCaptchaEntered":false,"captchaSettings":null,"premiumUserData":null,"isPrivate":false,"isExpired":false,"expireAfterRead":false,"isShared":false,"defaultAvatar":"\/static\/img\/avatar60.jpg","createdText":"6h","showLastEdit":false,"modifiedText":"6h","isInTrash":false,"viewsText":"2","favouritesCount":0,"onlineText":"1","getFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article\/42017684","addFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article","removeFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article-delete\/42017684","apiShowArticleDynamicUrl":"\/api\/v1\/article-dynamic","voteUrl":"\/api\/account\/v1\/vote","contentLang":"en","positiveVotes":0,"negativeVotes":0,"currentVote":"empty","linkSharingUrl":null,"linkSharingSecret":null}; </script> <script src="/build/runtime.a1e5a72a.js" async></script> <script src="/build/1676.2c557867.js" async></script> <script src="/build/8452.a9a1e0c5.js" async></script> <script src="/build/5936.ad26e56d.js" async></script> <script src="/build/9412.4a605741.js" async></script> <script src="/build/showarticlewidget.3bbca334.js" async></script> </head><body marginwidth="0" dir="ltr" marginheight="0"><!-- Static navbar --><div class="navbar navbar-default navbar-static-top mainTableTopMiddle" role="navigation"> <div class="container"> <div class="navbar-header pull-left"> <a href="/"><img src="/static/img/jp_logo_1_en_v4.png" width="186px" height="54px" alt="JustPaste.it" /></a> </div> <div class="navbar-header pull-left"> <div class="nav navbar-nav mainTableTopMiddleRight hidden-xs hidden-sm"> <img src="/static/img/jp_logo_2_en_v5.png" width="390px" height="54px" /> </div> </div> <div class="navbar-header pull-right" style="padding-top:8px"> <div id="mainPanelButtons"></div> </div> </div><!--/.nav-collapse --></div><div id="headContainer" class="container" style="max-width: 960px"> <div class="row"> <div class="col-md-12"> <div id="mainTableContent"> <div style="max-width: 960px; vertical-align: top"> <div id="showArticleWidget"><div class="showArticleWidgetPlaceholder"></div></div> <div id="articleContent"> <p>happy</p> <p>nice nice</p> <p>good good good</p> <p>joy Joy joy Joy joy</p> <p>Love love Love love Love</p> </div> <div id="showArticleBottomWidget"><div class="articleBottomWidgetPlaceholder"></div></div> <span style="visibility:hidden" class="glyphicon glyphicon-link"></span></div> </div> </div> </div> <!-- /row --></div> <!-- /container --><div id="footer" style="min-height: 30px;"> <div class="container" style="vertical-align: middle"> <div class="col-md-3 col-xs-5 col-sm-4 text-muted" style="font-size: 95%;" align="left"> © 2021 <span class="hidden-xs">justpaste.it</span> </div> <div class="col-md-9 col-xs-7 col-sm-8 text-muted" align="right"> <ul class="list-inline basePageFooterList"> <li class="hidden-xs"> <a href="/login">Account</a> </li> <li class="hidden-xs"> <a href="/terms">Terms</a> </li> <li class="hidden-xs"> <a href="/privacypolicy">Privacy</a> </li> <li class="hidden-xs"> <a href="/cookies">Cookies</a> </li> <li> <a href="/u/justpasteit">Blog</a> </li> <li> <a href="/about">About</a> </li> </ul> </div> </div></div> <script> window.mainPanelOptions = { addArticleUrl: '/', loginUrl: '/login', logoutUrl: '/logout', favouriteArticlesUrl: '/account/favourite', subscribedArticlesUrl: '/account/subscribed', sharedArticlesUrl: '/account/shared', manageAccountUrl: '/account/manage', messagesUrl: '/account/messages', articlesStatsUrl: '/account/articles-stats', premiumUrl: '/premium/subscription', unreadMessagesUrl: 'https://msg.justpaste.it/api/v1/conversation/unread', profileSettings: '/account/settings', isLoggedIn: false, userEmail: null, userPermalink: null, userProfileIsPublic: false, userProfileLink: null }; </script> <script src="/build/mainpanelwidget.80530742.js" async></script> </body></html>
The word happy was used 0 times.
The word nice was used 0 times.
The word good was used 1 times.
The word joy was used 3 times.
The word love was used 3 times.
How do I properly split or count the number of occurrence? Thank you!
Upvotes: 0
Views: 248
Reputation: 125
You can simply use jsoup: Java HTML Parser library to fetch all text of html structure.
Download jar file from: https://jsoup.org/download
Below code will count occurrences of words:
static void countOccurance(String htmlStructure) {
String[] positive = { "happy", "nice", "good", "joy", "love" };
Document document = Jsoup.parse(htmlStructure);
String[] text = document.body().text().split("\\s+");
for (String word : positive) {
int wordCount = countWord(text, word);
System.out.println("The word " + word + " was used " + wordCount + " times.");
}
}
static int countWord(String[] documentText, String wordToFind) {
int count = 0;
for (int i = 0; i < documentText.length; i++) {
if (wordToFind.equalsIgnoreCase(documentText[i]))
count++;
}
return count;
}
Upvotes: 2
Reputation: 140
This will help you to remove special characters, this will only allow alphabets for example : <>Hello<> will be replaced like Hello
String alphaOnly = input.replaceAll("[^a-zA-Z]+","");
Upvotes: 0