tsouchlarakis
tsouchlarakis

Reputation: 1619

Read IMDb Data into R

I need to read the data that IMDb makes publicly available through FTP here. The problem is that the data is not always in a consistent format. I have attached below a small snippet of the data (first several lines).

I've tried using read.table() with sep = '\t', but it does not split the lines with 100% accuracy.

Here you can find the sample data.

How can I read this table into R?

Upvotes: 0

Views: 571

Answers (1)

Martin Schmelzer
Martin Schmelzer

Reputation: 23909

Use plain readLines and then strsplit each line by \\t+.

file <- readLines("PATHTO/actorstest.txt", encoding = 'Latin-1')

# delete empty rows
file <- subset(file, !grepl('^\\s*$', file))

# split in two columns by one or more tabs
file <- strsplit(x = file, split = '\\t+') 

# row bind all itms and create df
df   <- data.frame(do.call(rbind, lapply(file, unlist)))
df

Which results in

                      X1                                                            X2
1            Aa, Brynjar                       Adj¯ solidaritet (1985)  [P¯nker]  <40>
2               Aa, Henk      Cuby + Blizzards: 40 jaar de blues (2006) (V)  [Himself]
3       Aa, Henk van der "De slimste mens ter wereld" (2012) {(#5.10)}  [Himself]  <4>
4                        "De slimste mens ter wereld" (2012) {(#5.11)}  [Himself]  <3>
5                         "De slimste mens ter wereld" (2012) {(#5.8)}  [Himself]  <3>
6                         "De slimste mens ter wereld" (2012) {(#5.9)}  [Himself]  <4>
7       Aab, Vanessa (I)                               Frollein FrappÈ (2014)  [Greta]
8                                                      Nach einem Traum (2014)  [Elke]
9            Aabear, Jim                     Paradise Recovered (2010)  [Richard]  <8>
10                                                          Senses (2009)  [Mr. Cohen]
11      Aabed, Essam Abu                              Omar (2013)  [Omar's Boss]  <10>
12 Aabedlaoui, El Hassan                           La vache (2016)  [Aissaoui 2]  <80>
13                Aabeel                                     Czeski Friends (2004) (V)
14         Aabel, Anders                                          Kontakt! (1956)  <7>

Notice that some actors have multiple entries in column two. I leave capturing that to you.

Upvotes: 1

Related Questions