user486174
user486174

Reputation: 41

Manipulating CSV file with a non standardized content

I have a CSV file with a non standardized content, it goes something like this:

John, 001
01/01/2015, hamburger
02/01/2015, pizza
03/01/2015, ice cream
Mary, 002
01/01/2015, hamburger
02/01/2015, pizza
John, 003
04/01/2015, chocolate

Now, what I'm trying to do is to write a logic in java to separate them.I would like "John, 001" as the header and to put all the rows under John, before Mary to be John's.

Will this be possible? Or should I just do it manually?

Edit:
For the input, even though it is not standardized, a noticeable pattern is that the row that do not have names will always starts with a date.
My output goal would be a java object, where I can store it in the database eventually in the format below.

Name, hamburger, pizza, ice cream, chocolate
John, 01/01/2015, 02/01/2015, 03/01/2015, NA
Mary, 01/01/2015, 02/01/2015, NA, NA
John, NA, NA, NA, 04/01/2015

Upvotes: 0

Views: 169

Answers (3)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Use uniVocity-parsers to handle this for you. It comes with a master-detail row processor.

// 1st, Create a RowProcessor to process all "detail" elements (dates/ingredients)
ObjectRowListProcessor detailProcessor = new ObjectRowListProcessor();

// 2nd, Create MasterDetailProcessor to identify whether or not a row is the master row (first value of the row is a name, second is an integer).
MasterDetailListProcessor masterRowProcessor = new MasterDetailListProcessor(RowPlacement.TOP, detailProcessor) {
    @Override
    protected boolean isMasterRecord(String[] row, ParsingContext context) {
         try{
             //tries to convert the second value of the row to an Integer.
             Integer.parseInt(String.valueOf(row[1]));
             return true;
         } catch(NumberFormatException ex){
             return false;
         }
    }
};

CsvParserSettings parserSettings = new CsvParserSettings();

// Set the RowProcessor to the masterRowProcessor.
parserSettings.setRowProcessor(masterRowProcessor);

CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader(yourFile));

// Here we get the MasterDetailRecord elements.
List<MasterDetailRecord> rows = masterRowProcessor.getRecords();

// Each master record has one master row and multiple detail rows.
MasterDetailRecord masterRecord = rows.get(0);
Object[] masterRow = masterRecord.getMasterRow();
List<Object[]> detailRows = masterRecord.getDetailRows();

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148880

If I have correctly understood, the specs are :

  • input is text, one record per line (fields are comma delimited)
  • 2 kinds of records :
    • headers consisting of a name and a number (number is ignored)
    • actual records consisting of a date and a meal
  • output should contain :
    • one header containing the constant Name, and the meals in order of occurence
    • on record per name consisting with the name and the dates corresponding to the meals - an absent field will have NA constant string
  • we assume that we will never get for a name the same date for different input records.

The algorithm is in pseudo code :

Data structures :
 one list of struct< string name, hash< int meal index, date> > for the names : base
 one list of strings for the meals : meals

Code :

name = null
iname = -1
Loop per input lines {
  if first field is date {
    if name == null {
      throw Exception("incorrect structure");
    }
    meal = second field
    look for index of meal in meals
    if not found {
      index = len(meals);
      add meal at end of list meals
    }
    base[iname].hash[index] = date
  }
  else {
    name = first field
    iname += 1
    add a new struc { name, empty hash } at end of list base
  }
}
close input file
open output file
// headers
print "names"
for meal in meals {
  print ",", meal
}
print newline
for (i=0; i<=iname; i++) {
  print base[i].name
  for meal in meals {
    look for meal in base[i].hash.keys
    if found {
      print ",", base[i].hash[meal]
    }
    else {
      print ",NA"
    }
  }
  print newline
}
close output file

Just code it in correct Java and come back here if you have any problem.

Upvotes: 2

Timo Hanisch
Timo Hanisch

Reputation: 224

You could just read the file into a list

List<String> lines = Files.readAllLines(Paths.get(path), StandardCharsets.UTF_8);

Afterwards iterate over the list and split them for wanted delimiters (",").

Now you could just use if-else or switch blocks to check for specific entries.

List<DataObject> objects = new ArrayList<>();
DataObject dataObject = null;
for(String s : lines) {
    String [] splitLine = s.split(",");
    if(splitLine[0].matches("(\d{2}\/){2}\d{4}")) {
        // We found a data
        if(dataObject != null && splitLine.length == 2) {
            String date = splitLine[0];
            String dish = splitLine[1];
            dataObject.add(date, dish);
        } else {
            // Handle error
        }
    } else if(splitLine.length == 2) {
        // We can create a new data object
        if(dataObject != null) {
            objects.add(dataObject);
        }
        String name = splitLine[0];
        String id = splitLine[1];
        dataObject = new DataObject(name, id);
    } else {
        // Handle error
    }
}

Now you can sort them into your specific categories.

Edit: Changed the loop and added a regex (which may not be optimal) for matching the date strings and using them to decide whether to add them to the last data object.

The DataObject class can contain data structures holding the dates/dishes. When the CSV is parsed you can iterate over the objects List and do whatever you want. I hope this answer helps :)

Upvotes: 2

Related Questions