Reputation: 1
I am working on a Java project which is basically a Fake News Detection Application. The dataset contains two columns Text(News Article) and Label(0:Fake/1:Genuine). This data is converted to a JSON file. In Java, I have used Regex to replace all the stop words into Spaces(" "). Then I was working with vectorization in Java. I faced issues with inbuild vectorization techniques in Weka and Deeplearning4j. Now, I used "StringToWordVector" filter to vectorize the text. I will provide with the code for the .java files in my Java application.
DataProcessor.java
package fnd;
import com.fasterxml.jackson.databind.ObjectMapper;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
public class DataProcessor {
public static void main(String[] args) {
try {
// Specify the path to your JSON file containing news data
String jsonFilePath = "src/main/resources/fnd_output.json";
// Create ObjectMapper instance to read JSON
ObjectMapper objectMapper = new ObjectMapper();
// Deserialize JSON array into an array of News objects
News[] newsArray = objectMapper.readValue(new File(jsonFilePath), News[].class);
// Prepare attributes for the Instances
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("text", (ArrayList<String>) null)); // Text attribute as string
// Define nominal values for the label attribute
ArrayList<String> labelValues = new ArrayList<>();
labelValues.add("positive");
labelValues.add("negative");
Attribute labelAttribute = new Attribute("label", labelValues); // Label attribute as nominal
attributes.add(labelAttribute);
// Create an empty Instances object
Instances instances = new Instances("TextInstances", attributes, 0);
// Set the index of the class attribute (label attribute)
instances.setClassIndex(attributes.size() - 1);
// Process each News object and add to Instances
for (News news : newsArray) {
String processedText = TextPreprocessor.preprocessText(news.getText());
// Vectorize the processed text
Instances vectorizedInstance = TextVectorization.vectorizeText(processedText);
// Create a new Instance
Instance instance = new DenseInstance(attributes.size());
// Set the dataset for the instance
instance.setDataset(instances);
// Handle text attribute (assuming it's a string attribute)
Attribute textAttr = attributes.get(0);
if (textAttr.isString()) {
instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));
} else {
System.err.println("Text attribute is not a string attribute.");
}
// Handle label attribute (assuming it's a nominal attribute)
Attribute labelAttr = labelAttribute;
if (labelAttr.isNominal()) {
instance.setValue(labelAttr, news.getLabel());
} else {
System.err.println("Label attribute is not a nominal attribute.");
}
// Add the instance to Instances
instances.add(instance);
}
// Output instances to ARFF file
ArffSaver arffSaver = new ArffSaver();
arffSaver.setInstances(instances);
arffSaver.setFile(new File("vectorized_text_with_labels.arff"));
arffSaver.writeBatch();
System.out.println("Text vectorization complete with labels. Saved as vectorized_text_with_labels.arff");
} catch (IOException e) {
e.printStackTrace();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
TextVectorization
package fnd;
import java.io.File;import java.io.IOException;import java.util.ArrayList;
import weka.core.Attribute;import weka.core.Instances;import weka.core.DenseInstance;import weka.core.converters.ArffSaver;import weka.filters.Filter;import weka.filters.unsupervised.attribute.StringToWordVector;import weka.core.Instance;
public class TextVectorization {
// Method to perform text vectorization (convert string to word vector)
public static Instances vectorizeText(String text) throws Exception {
// Create ArrayList to hold attributes
ArrayList<Attribute> attributes = new ArrayList<>();
// Create a single attribute named "text"
Attribute textAttribute = new Attribute("text", (ArrayList<String>) null);
attributes.add(textAttribute);
// Create Instances object with the specified attribute
Instances instances = new Instances("TextInstances", attributes, 0);
instances.setClass(textAttribute); // Set the class attribute to "text"
// Create a new Instance with the provided text
// Create a new Instance
Instance instance = new DenseInstance(instances.numAttributes());
instance.setValue(textAttribute, text);
instances.add(instance);
// Apply StringToWordVector filter to vectorize the text
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(instances);
Instances vectorizedData = Filter.useFilter(instances, filter);
return vectorizedData;
}
public static void saveInstancesToArff(Instances instances, String filename) throws IOException {
ArffSaver arffSaver = new ArffSaver();
arffSaver.setInstances(instances);
arffSaver.setFile(new File(filename));
arffSaver.writeBatch();
}
}
TextPreprocessor
package fnd;
import java.util.regex.Matcher;import java.util.regex.Pattern;
public class TextPreprocessor {
private static final Pattern URL_PATTERN = Pattern.compile("http[s]?://\\S+|www\\.\\S+");
private static final Pattern HTML_TAG_PATTERN = Pattern.compile("<[^>]+>");
public static String preprocessText(String text) {
if (text == null || text.isEmpty()) {
return "";
}
// Convert text to lowercase
text = text.toLowerCase();
// Remove URLs and HTML tags
text = removeUrlsAndHtmlTags(text);
// Remove non-word characters (except spaces), digits, and newline characters
text = removeSpecialCharacters(text);
return text;
}
private static String removeUrlsAndHtmlTags(String text) {
Matcher urlMatcher = URL_PATTERN.matcher(text);
text = urlMatcher.replaceAll("");
Matcher htmlTagMatcher = HTML_TAG_PATTERN.matcher(text);
text = htmlTagMatcher.replaceAll("");
return text;
}
private static String removeSpecialCharacters(String text) {
StringBuilder processedText = new StringBuilder(text.length());
for (char ch : text.toCharArray()) {
if (Character.isLetter(ch) || Character.isWhitespace(ch)) {
processedText.append(ch);
}
}
return processedText.toString();
}
}
Details about the error according to my knowledge.
java.lang.IllegalArgumentException: Attribute isn't nominal, string or date!at weka.core.AbstractInstance.stringValue(AbstractInstance.java:674)at weka.core.AbstractInstance.stringValue(AbstractInstance.java:644)at fnd.DataProcessor.main(DataProcessor.java:60)
Uncommenting this the line will run the code.
instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));
How can I vectorize the text and then fed the data into the model?
Upvotes: 0
Views: 77
Reputation: 2608
You are processing your data one document at a time, each time re-initializing the StringToWordVector filter. However, the filter will produce a difference bag of words each time, based on the content of the document that you just pushed through. Each time, the columns in the vectorized output will relate to different words. As a bare minimum to fix this, you need to add all your textual data to a single weka.core.Instances object and then apply the filter.
But...
Since you are planning on performing classification, you should use the StringToWordVector in conjunction with the FilteredClassifier meta-classifier and the choice of your base classifier for doing the actual classification. That way you can ensure that subsequent predictions will get preprocessed properly using the initialized StringToWordVector filter. In such a scenario, your textual data should be the first attribute and the label associated with that text the second attribute (the class attribute).
Upvotes: 0