Reputation: 1
I have a CSV file that is arranged like below:
Year1
Award1,Winner1,Winner2,Winner3...
Award2,Winner4,Winner5,Winner6...
...
Year2
Award1,Winner7,Winner8,Winner9...
How could I rearrange this data into the below format, where the first row is the header?
Year,AwardType,Winner
Year1,Award1,Winner1
Year1,Award1,Winner2
...
Year1,Award2,Winner6
...
Year2,Award1,Winner7
...
I plan to do some relatively simple analyses in R, and I think the desired layout would make the data easier to work with. If that's not the case, I'm open to other suggestions.
Thank you
Upvotes: 0
Views: 147
Reputation: 492
Well if you don't mind using java you could use something like this
javac Converter.java
java Converter > newdata.csv
With this being the code in Converter.java
public class Converter {
public static void main(String[] args) throws FileNotFoundException {
File file = new File("data.csv");
Scanner scan = new Scanner(file);
String year = null;
System.out.println("Year,AwardType,Winner");
while(scan.hasNext()) {
String line = scan.nextLine();
if (line.length() == 4) {
year = line;
} else {
String[] awardPlusWinners = line.split(",");
for ( int i = 1; i < awardPlusWinners.length; i++) {
System.out.println(year + "," + awardPlusWinners[0] + "," + awardPlusWinners[i]);
}
}
}
}
}
Upvotes: -1
Reputation: 58875
Here is my R solution. Start with some mock data. Your real example will be in a file.
mockfile <-
"Year1
Award1,Winner1,Winner2,Winner3
Award2,Winner4,Winner5,Winner6
Award3,Winner7,Winner8,Winner9
Year2
Award1,Winner7,Winner8,Winner9
Award2,Winner12,Winner13,Winner14
Award3,Winner15,Winner16,Winner17"
textConnection(mockfile)
in the rest would be replace with the filename in your case
entries <- count.fields(textConnection(mockfile), sep=",")
blockstart <- which(entries==1)
blocklength <- diff(c(blockstart, length(entries)+1))-1
Find the lines with just one thing on them, that is the start of a block. Also find the length of a block. If all your blocks are the same, these steps can be considerably simplified.
con <- textConnection(mockfile)
# get to first single line
readLines(con, n=blocstart[1]-1)
blocks <- list()
# iterate over blocks
for (i in seq_along(blockstart)) {
# read the single line; that is the year
Year <- readLines(con, n=1L)
# feed the block part to read.csv
rest <- read.csv(text=readLines(con, blocklength[i]), header=FALSE)
rest$Year <- Year
blocks[[i]] <- rest
}
# bind all the blocks together
full <- do.call(rbind, blocks)
# rename the award column
names(full)[1] <- "AwardType"
This gives a data frame that looks like
> full
AwardType V2 V3 V4 Year
1 Award1 Winner1 Winner2 Winner3 Year1
2 Award2 Winner4 Winner5 Winner6 Year1
3 Award3 Winner7 Winner8 Winner9 Year1
4 Award1 Winner7 Winner8 Winner9 Year2
5 Award2 Winner12 Winner13 Winner14 Year2
6 Award3 Winner15 Winner16 Winner17 Year2
To reshape it the way you want, I find the reshape2
package easiest.
library("reshape2")
melt(full, id.vars=c("Year","AwardType"))
which gives
> melt(full, id.vars=c("Year","AwardType"))
Year AwardType variable value
1 Year1 Award1 V2 Winner1
2 Year1 Award2 V2 Winner4
3 Year1 Award3 V2 Winner7
4 Year2 Award1 V2 Winner7
5 Year2 Award2 V2 Winner12
6 Year2 Award3 V2 Winner15
7 Year1 Award1 V3 Winner2
8 Year1 Award2 V3 Winner5
9 Year1 Award3 V3 Winner8
10 Year2 Award1 V3 Winner8
11 Year2 Award2 V3 Winner13
12 Year2 Award3 V3 Winner16
13 Year1 Award1 V4 Winner3
14 Year1 Award2 V4 Winner6
15 Year1 Award3 V4 Winner9
16 Year2 Award1 V4 Winner9
17 Year2 Award2 V4 Winner14
18 Year2 Award3 V4 Winner17
You can delete the variable
column if you really don't want it.
Upvotes: 3
Reputation: 32401
Here is a solution in R.
d <- read.table("tmp.csv")$V1
result <- list()
year <- "Unknown"
for( line in d ) {
if( grepl(",", line) ) {
line <- strsplit(line, ",")[[1]]
line <- data.frame( year = year, award=line[1], winner=line[-1] )
result <- append( result, list(line) )
} else {
year <- line
}
}
result <- do.call(rbind, result)
Upvotes: 1