HP.
HP.

Reputation: 19896

Parse CSV with double quote in some cases

I have csv that comes with format:

a1, a2, a3, "a4,a5", a6

Only field with , will have quotes

Using Java, how to easily parse this? I try to avoid using open source CSV parser as company policy. Thanks.

Upvotes: 28

Views: 38759

Answers (6)

sandeep
sandeep

Reputation: 1

public static void main(String[] args) {
    
    final StringBuilder sb = new StringBuilder(240000);
    String s = "";
    boolean start = false;
    boolean ending = false;
    boolean nestedQuote = false;
    boolean nestedComma = false;
    
    char previous = 0 ;
    
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
       if(!start &&c=='"' && previous == ',' && !nestedQuote ) {
            System.out.println("sarted");
            sb.append(c); 
            start = true;
            previous = c;
            System.out.println(sb);
            continue;
        }
       
       if(start && c==',' && previous == '"')  
       {
        nestedQuote = false;
        System.out.println("ended");
        sb.append(c); 
        previous = c;
        System.out.println(sb);
        start = false;
        ending = true;
        continue;
       }
     
       if(start  && c== ',' && previous!='\"'&& !nestedQuote) 
       {
           previous = c;
           sb.append(';'); 
           continue;
       }
           
       
       if(start && ending && c== '"') 
       {
           nestedQuote = true;
           sb.append(c); 
           previous = c;
           continue;
       }
       if(start && c== '"' && nestedQuote) 
       {
           nestedQuote = false;
           previous = c;
           continue;
       }
       
       if(start && c==',' && nestedQuote) 
       {
           nestedComma = true;
           sb.append(';'); 
           previous = c;
           continue;
       }
       
       if(start &&c==',' && nestedQuote && nestedComma) 
       {
           nestedComma = false;
           previous = c;
           continue;
       }
       
        sb.append(c);  
        previous = c;
        
    }
    System.out.println(sb.toString().replaceAll("\"", ""));
}

Upvotes: 0

Marco Guardigli
Marco Guardigli

Reputation: 1

Here is my solution, in Python. It can take care of single level quotes.

def parserow(line):
    ''' this splits the input line on commas ',' but allowing commas within fields
    if they are within double quotes '"'
    example:
        fieldname1,fieldname2,fieldname3
        field value1,"field, value2, allowing, commas", field value3
    gives:
        ['field value1','"field, value2, allowing, commas"', ' field value3']
    '''
    out = []
    current_field = ''
    within_quote = False
    for c in line:
        if c == '"':
            within_quote = not within_quote
        if c == ',':
            if not within_quote:
                out.append(current_field)
                current_field = ''
                continue
        current_field += c
    if len(current_field) != 0:
        out.append(current_field)
    return out

Upvotes: 0

craigrs84
craigrs84

Reputation: 3064

The below code seems to work well and can handle quotes within quotes.

final static Pattern quote = Pattern.compile("^\\s*\"((?:[^\"]|(?:\"\"))*?)\"\\s*,");

public static List<String> parseCsv(String line) throws Exception
{       
    List<String> list = new ArrayList<String>();
    line += ",";

    for (int x = 0; x < line.length(); x++)
    {
        String s = line.substring(x);
        if (s.trim().startsWith("\""))
        {
            Matcher m = quote.matcher(s);
            if (!m.find())
                throw new Exception("CSV is malformed");
            list.add(m.group(1).replace("\"\"", "\""));
            x += m.end() - 1;
        }
        else
        {
            int y = s.indexOf(",");
            if (y == -1)
                throw new Exception("CSV is malformed");
            list.add(s.substring(0, y));
            x += y;
        }
    }
    return list;
}

Upvotes: 1

bestsss
bestsss

Reputation: 12066

Here is some code for you, I hope using code out of here doesn't count open source, which is.

package bestsss.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class SplitCSVLine {
    public static String[] splitCSV(BufferedReader reader) throws IOException{
        return splitCSV(reader, null, ',', '"');
    }

    /**
     * 
     * @param reader - some line enabled reader, we lazy
     * @param expectedColumns - convenient int[1] to return the expected
     * @param separator - the C(omma) SV (or alternative like semi-colon) 
     * @param quote - double quote char ('"') or alternative
     * @return String[] containing the field
     * @throws IOException
     */
    public static String[] splitCSV(BufferedReader reader, int[] expectedColumns, char separator, char quote) throws IOException{       
        final List<String> tokens = new ArrayList<String>(expectedColumns==null?8:expectedColumns[0]);
        final StringBuilder sb = new StringBuilder(24);

        for(boolean quoted=false;;sb.append('\n')) {//lazy, we do not preserve the original new line, but meh
            final String line = reader.readLine();
            if (line==null)
                break;
            for (int i = 0, len= line.length(); i < len; i++) { 
                final char c = line.charAt(i);
                if (c == quote) {
                    if( quoted   && i<len-1 && line.charAt(i+1) == quote ){//2xdouble quote in quoted 
                        sb.append(c);
                        i++;//skip it
                    }else{
                        if (quoted){
                            //next symbol must be either separator or eol according to RFC 4180
                            if (i==len-1 || line.charAt(i+1) == separator){
                                quoted = false;
                                continue;
                            }
                        } else{//not quoted
                            if (sb.length()==0){//at the very start
                                quoted=true;
                                continue;
                            }
                        }
                        //if fall here, bogus, just add the quote and move on; or throw exception if you like to
                        /*
                        5.  Each field may or may not be enclosed in double quotes (however
                           some programs, such as Microsoft Excel, do not use double quotes
                           at all).  If fields are not enclosed with double quotes, then
                           double quotes may not appear inside the fields.
                      */ 
                        sb.append(c);                   
                    }
                } else if (c == separator && !quoted) {
                    tokens.add(sb.toString());
                    sb.setLength(0); 
                } else {
                    sb.append(c);
                }
            }
            if (!quoted)
                break;      
        }
        tokens.add(sb.toString());//add last
        if (expectedColumns !=null)
            expectedColumns[0] = tokens.size();
        return tokens.toArray(new String[tokens.size()]);
    }
    public static void main(String[] args) throws Throwable{
        java.io.StringReader r = new java.io.StringReader("222,\"\"\"zzzz\", abc\"\" ,   111   ,\"1\n2\n3\n\"");
        System.out.println(java.util.Arrays.toString(splitCSV(new BufferedReader(r))));
    }
}

Upvotes: 3

K4KYA
K4KYA

Reputation: 147

I came across this same problem (but in Python), one way I found to solve it, without regexes, was: When you get the line, check for any quotes, if there are quotes, split the string on quotes, and split the even indexed results of the resulting array on commas. The odd indexed strings should be the full quoted values.

I'm no Java coder, so take this as pseudocode...

line = String[];
    if ('"' in row){
        vals = row.split('"');
        for (int i =0; i<vals.length();i+=2){
            line+=vals[i].split(',');
        }
        for (int j=1; j<vals.length();j+=2){
            line+=vals[j];
        }
    }
    else{
        line = row.split(',')
    }

Alternatively, use a regex.

Upvotes: 4

Mark Byers
Mark Byers

Reputation: 838974

You could use Matcher.find with the following regular expression:

\s*("[^"]*"|[^,]*)\s*

Here's a more complete example:

String s = "a1, a2, a3, \"a4,a5\", a6";
Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

See it working online: ideone

Upvotes: 26

Related Questions