Trayton White
Trayton White

Reputation: 1

How do I transform CSV file where desired headers are "in line" with the data?

I have a CSV file that is arranged like below:

Year1
Award1,Winner1,Winner2,Winner3...
Award2,Winner4,Winner5,Winner6...
...
Year2
Award1,Winner7,Winner8,Winner9...

How could I rearrange this data into the below format, where the first row is the header?

Year,AwardType,Winner
Year1,Award1,Winner1
Year1,Award1,Winner2
...
Year1,Award2,Winner6
...
Year2,Award1,Winner7
...

I plan to do some relatively simple analyses in R, and I think the desired layout would make the data easier to work with. If that's not the case, I'm open to other suggestions.

Thank you

Upvotes: 0

Views: 147

Answers (3)

Appak
Appak

Reputation: 492

Well if you don't mind using java you could use something like this

javac Converter.java
java Converter > newdata.csv


With this being the code in Converter.java

public class Converter {

    public static void main(String[] args) throws FileNotFoundException {
        File file = new File("data.csv");
        Scanner scan = new Scanner(file);
        String year = null;
        System.out.println("Year,AwardType,Winner");
        while(scan.hasNext()) {
            String line = scan.nextLine();
            if (line.length() == 4) {
                year = line;
            } else {
                String[] awardPlusWinners = line.split(",");

                for ( int i = 1; i < awardPlusWinners.length; i++) {
                    System.out.println(year + "," + awardPlusWinners[0] + "," + awardPlusWinners[i]);
                }
            }
        }
    }
}

Upvotes: -1

Brian Diggs
Brian Diggs

Reputation: 58875

Here is my R solution. Start with some mock data. Your real example will be in a file.

mockfile <- 
"Year1
Award1,Winner1,Winner2,Winner3
Award2,Winner4,Winner5,Winner6
Award3,Winner7,Winner8,Winner9
Year2
Award1,Winner7,Winner8,Winner9
Award2,Winner12,Winner13,Winner14
Award3,Winner15,Winner16,Winner17"

textConnection(mockfile) in the rest would be replace with the filename in your case

entries <- count.fields(textConnection(mockfile), sep=",")
blockstart <- which(entries==1)
blocklength <- diff(c(blockstart, length(entries)+1))-1

Find the lines with just one thing on them, that is the start of a block. Also find the length of a block. If all your blocks are the same, these steps can be considerably simplified.

con <- textConnection(mockfile)
# get to first single line
readLines(con, n=blocstart[1]-1)
blocks <- list()
# iterate over blocks
for (i in seq_along(blockstart)) {
    # read the single line; that is the year
    Year <- readLines(con, n=1L)
    # feed the block part to read.csv
    rest <- read.csv(text=readLines(con, blocklength[i]), header=FALSE)
    rest$Year <- Year
    blocks[[i]] <- rest
}
# bind all the blocks together
full <- do.call(rbind, blocks)
# rename the award column
names(full)[1] <- "AwardType"

This gives a data frame that looks like

> full
  AwardType       V2       V3       V4  Year
1    Award1  Winner1  Winner2  Winner3 Year1
2    Award2  Winner4  Winner5  Winner6 Year1
3    Award3  Winner7  Winner8  Winner9 Year1
4    Award1  Winner7  Winner8  Winner9 Year2
5    Award2 Winner12 Winner13 Winner14 Year2
6    Award3 Winner15 Winner16 Winner17 Year2

To reshape it the way you want, I find the reshape2 package easiest.

library("reshape2")

melt(full, id.vars=c("Year","AwardType"))

which gives

> melt(full, id.vars=c("Year","AwardType"))
    Year AwardType variable    value
1  Year1    Award1       V2  Winner1
2  Year1    Award2       V2  Winner4
3  Year1    Award3       V2  Winner7
4  Year2    Award1       V2  Winner7
5  Year2    Award2       V2 Winner12
6  Year2    Award3       V2 Winner15
7  Year1    Award1       V3  Winner2
8  Year1    Award2       V3  Winner5
9  Year1    Award3       V3  Winner8
10 Year2    Award1       V3  Winner8
11 Year2    Award2       V3 Winner13
12 Year2    Award3       V3 Winner16
13 Year1    Award1       V4  Winner3
14 Year1    Award2       V4  Winner6
15 Year1    Award3       V4  Winner9
16 Year2    Award1       V4  Winner9
17 Year2    Award2       V4 Winner14
18 Year2    Award3       V4 Winner17

You can delete the variable column if you really don't want it.

Upvotes: 3

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32401

Here is a solution in R.

d <- read.table("tmp.csv")$V1
result <- list()
year <- "Unknown"
for( line in d ) {
  if( grepl(",", line) ) {
    line <- strsplit(line, ",")[[1]]
    line <- data.frame( year = year, award=line[1], winner=line[-1] )
    result <- append( result, list(line) )
  } else {
    year <- line
  }
}
result <- do.call(rbind, result)

Upvotes: 1

Related Questions