techvigil
techvigil

Reputation: 47

Excel - combining two csv files into one with a common column

I am working with two datasets in csv form (movielens latest-small dataset). Given below are the fields of both.

rating.csv

  user_id   movie_id  rating

movie.csv

  movie_id   movie_name

what I want is to combine them into a single .csv with following fields

user_id   movie_id   movie_name   rating

So that the common column movie_id maps with corresponding movie_name.

Could that be done using Excel? If not, how can I do it?

I just need it as a dataset for my recommender engine, so any simple solution is welcome as end result is all that matters. But since I've some experience in java so that would be easy for my easy understand and implement.

If there is some way using Excel then that would be the best. I have tried searching online and found some VLOOKUP method but couldn't clearly get it.

Also I tried some online merging tools but they just attached the sheets one after the another not mapping the column. So I have no problem using online tools too.

Upvotes: 0

Views: 7882

Answers (2)

Aurielle Perlmann
Aurielle Perlmann

Reputation: 5529

Check out this tool - https://github.com/DataFoxCo/gocsv - it's based off of csvkit but has a ton of additional features. One of our engineers custom built it - and open sourced it to help solve some of these data issues we deal with every day :)

It will do a vlookup essentially of any sized csv in merely seconds using the join command:

gocsv join --columns 'movie_id','movie_id' --left rating.csv movie.csv > combineddata.csv

then if you still want to reorder the columns, you can do that too:

gocsv select --columns 'user_id','movie_id','movie_name','rating' combineddata.csv > combineddata-final.csv

I split the commands out up top to help explain its use - the documentation has all the examples on it also but ultimately I would really recommend pipelining it and doing it in one command like this:

cat rating.csv \ | gocsv join --left --columns 'movie_id','movie_id' movie.csv \ | gocsv select --columns 'user_id','movie_id','movie_name','rating' > combineddata.csv

Upvotes: 0

Robin Mackenzie
Robin Mackenzie

Reputation: 19289

This is the method with a VLOOKUP formula within Excel:

enter image description here

The formula takes 4 arguments:

  1. The value you are wanting to look up
  2. The range of data you are looking into
  3. The column within (2) that contains the answer you want
  4. Whether to match on (1) approximately i.e. FALSE = exact match

See here for documentation on the function.

Upvotes: 1

Related Questions