Reputation: 1
I have a csv file in which each cell is a number which may contain a comma. Now I like to import it into my program so that the cells are splitted. Is there any way to join the numbers correctly?
For example the following file content:
column1 column2 column3 column4
10 10,000 100 1000
After splitting by comma it is listed:
column1 column2 column3 column4 column5
10 10 000 100 1000
while the excepted output would be:
column1 column2 column3 column4
10 10000 100 1000
Here is my code:
File file = new File("csv file");
try (BufferedReader bufRdr = new BufferedReader(new FileReader(file))) {
while((line = bufRdr.readLine()) != null)
{
if((i<5) || (i == myDimensionProcessor.rowCount-1))
{
i++;
}
else
{
String a = line.replaceAll("[/\\\\%|*^`]", "");
c = a.replaceAll("--", "NA");
i++;
StringTokenizer st = new StringTokenizer(c,",");
while (st.hasMoreTokens())
{
//get next token and store it in the array
numbers[row][col] = st.nextToken();
System.out.print(numbers[row][col]);
col++;
}
col = 0;
System.out.println();
row++;
}
}
Actually I am splitting the csv file value into an array, but the value used with comma should not be splitted. How can i do that?
Even though it is splitting the comma value into different columns, how can I append that splitted value into a single cell and then the remaining value should be automatically shifted into the columns before?
Upvotes: 0
Views: 1275
Reputation: 6289
Csv is way more complicated to process correctly than it looks. You are not going to end up with a reliable solution using regular expressions, not to mention this is going to be extremely slow.
Just use a library dedicated for that, such as uniVocity-parsers and save yourself the headache. Here's a simple example:
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);
List<String[]> allRows = parser.parseAll(new FileReader(file));
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
Upvotes: 0
Reputation: 1272
If your numbers with commas are in some kind of quotation marks you should first replace numbers like this "10,000" with 10000 and then do the tokenization. You can do it like this:
line = line.replaceAll("\"\\s*(\\d+),(\\d+)\\s*\"", "$1$2");
Or if you like to preserve the quotation marks:
line = line.replaceAll("(\")\\s*(\\d+),(\\d+)\\s*(\")", "$1$2$3$4");
You need to replace the \" in the code samples above with the appropriate quotation mark character.
If all your numbers in the csv are in quotation marks you can remove all commas in numbers like this:
line = line.replaceAll("(\\d+),(\\d+)", "$1$2");
line = line.replace("\"", "");
This will work if no csv delimiter comma is between two digits. When all your numbers are in quotation marks this is safe to assume. Use the second line if you also want to remove the quotation marks.
Upvotes: 1
Reputation: 191701
If your column with commas in it is not quoted, then you'll need to extract out each field manually.
I ran into this issue at work the other day and here was my thought process.
I formed a regex starting with groupings for each column, so 4 columns separated by comma and potential space would be
(),\s*(),\s*(),\s*()$
Then I filled in the groups with the data I wanted to match
(\d+),\s*(.+),\s*(\d+),\s*(\d+)$
In this case you have normal numbers for every column except the second, and so the second group grabs any character it can find. There may be a better way to write that, and I certainly would like to know, but I was in a time-crunch when I figured it out. And to show it works, a Regex101.
And following the other answer here is the relevant Java code,
line = line.replaceAll("(\\d+),\\s*(.+),\\s*(\\d+),\\s*(\\d+)$", "$1$2$3$4");
Upvotes: 0