Reputation: 1619
I need to read the data that IMDb makes publicly available through FTP here. The problem is that the data is not always in a consistent format. I have attached below a small snippet of the data (first several lines).
I've tried using read.table()
with sep = '\t'
, but it does not split the lines with 100% accuracy.
Here you can find the sample data.
How can I read this table into R?
Upvotes: 0
Views: 571
Reputation: 23909
Use plain readLines
and then strsplit
each line by \\t+
.
file <- readLines("PATHTO/actorstest.txt", encoding = 'Latin-1')
# delete empty rows
file <- subset(file, !grepl('^\\s*$', file))
# split in two columns by one or more tabs
file <- strsplit(x = file, split = '\\t+')
# row bind all itms and create df
df <- data.frame(do.call(rbind, lapply(file, unlist)))
df
Which results in
X1 X2
1 Aa, Brynjar Adj¯ solidaritet (1985) [P¯nker] <40>
2 Aa, Henk Cuby + Blizzards: 40 jaar de blues (2006) (V) [Himself]
3 Aa, Henk van der "De slimste mens ter wereld" (2012) {(#5.10)} [Himself] <4>
4 "De slimste mens ter wereld" (2012) {(#5.11)} [Himself] <3>
5 "De slimste mens ter wereld" (2012) {(#5.8)} [Himself] <3>
6 "De slimste mens ter wereld" (2012) {(#5.9)} [Himself] <4>
7 Aab, Vanessa (I) Frollein FrappÈ (2014) [Greta]
8 Nach einem Traum (2014) [Elke]
9 Aabear, Jim Paradise Recovered (2010) [Richard] <8>
10 Senses (2009) [Mr. Cohen]
11 Aabed, Essam Abu Omar (2013) [Omar's Boss] <10>
12 Aabedlaoui, El Hassan La vache (2016) [Aissaoui 2] <80>
13 Aabeel Czeski Friends (2004) (V)
14 Aabel, Anders Kontakt! (1956) <7>
Notice that some actors have multiple entries in column two. I leave capturing that to you.
Upvotes: 1