Niek de Klein
Niek de Klein

Reputation: 8824

How to format data in (a) CSV file(s) so that it can easily be imported in R?

Edit:

So, this format would work:

featureID    charge    xcoordinate    ycoordinate
1            2         5105.9217      336.125209180674
1            2         5108.7642      336.124751115092
2            0         2434.9217      145.893331325278

But what if I have two columns with multiple value that are linked. Say column quality has a machine and a quality linked and the column looks like this

 MachineQuality
 [[{1:1224}, {2:3453}], [{1:2242}, {2:4142}]

Now if I want to split that up like I did with the coordinates of the convexhull I would need 2 rows instead of 1. But wouldn't I need 2 rows for every row that is already in (so 4, because there are already 2 extra for the coordinates) like this:

featureID    charge    xcoordinate    ycoordinate         quality1    quality2
1            2         5105.9217      336.125209180674    1224        3453
1            2         5105.9217      336.125209180674    2242        4142
1            2         5108.7642      336.124751115092    1224        3453
1            2         5108.7642      336.124751115092    2242        4142
[...]

Would it have to be like this?


I'm very new to R, my knowledge doesn't go much further than knowing how to make a vector and some simple plots. I'm going to use R for an internship project the next couple of months and during this time I will (hopefully) learn some of the ins and outs of R. However, before I start I need to produce the data that I'm going to do the statistics on. I need to know beforehand how I should format my output CSV data so that I can easily read it in once I start my R analysis.

One thing that I've been asked to do is make a CSV file out of the data so that it can be read in by R. The example CSV files for importing with R that I've seen all look like this

featureID    Charge    value
1            2         10
2            0         9

However, my data mostly consists out of columns for which the values contain multiple values. To clarify: As an example, my data exists of "features" that, amongs other information has a "convexhull". This convexhull consists of paired x and y coordinates. So what I could have for data is (only showing two coordinates, can be many)

featureID    Charge    Convexhull
1            2         [[{'y': '336.125209180674'}, {'x': '5105.9217'}], [{'y': '336.124751115092'}, {'x': '5108.7642'}]]

Is it possible to get this in one CSV file, being able to read it in R correctly (so that the paired x and y coordinates are preserved)? If so, how should the CSV file look like? For example, I've seen examples for CSV files with multiple values that look like this:

featureID    charge    xcoordinate    ycoordinate
1            2         5105.9217      336.125209180674
                       5108.7642      336.124751115092
2            0         2434.9217      145.893331325278

But I can't find if this is easily imported by R.

If this is not doable in one CSV file, are the CSV files easily imported independently, with a primary key idea, like database linking?

Upvotes: 2

Views: 1254

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269556

long vs. wide form. Your last example is known as long form (except all cells should be filled in) and your first example is roughly wide form as discussed on the ?reshape page and illustrated in the examples at the end of that page. You likely want to stick with long form. For an alternative see the reshape2 package.

save & load. Note that if you are only writing it out to read it back in to R later (as opposed to communicating it to some other software) you could use save and load which don't require any change to the object at all.

json. Another possibility given the form of your example is that you might want to look at the rjson package .

Upvotes: 2

John
John

Reputation: 23758

The only critical things are that you have a unique character separating your data columns and that each column is the same length. As long as the second row in your last example is filled in that will import fine.

You need to consider what you want to do with the data after it's in R to decide how you might want any other special formatting beforehand. But, as long as the column separator is a unique character and the columns are of equal length then it will import.

(You can violate the unique separator requirement if your entries are wrapped in quotes. And if you want to get really fancy you could "import" almost anything. But if someone's asking you to format the data then they probably want a rectangular data.frame compatible layout. They probably want unique values in each column (no columns of points). But that's between you and them.)

Upvotes: 2

Related Questions