ccamara
ccamara

Reputation: 1225

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:

movies = c("Star Wars Episode IV: A New Hope", "Interstellar", 
       "Happythankyoumoreplease")

I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like

for (i in movies) {
  info = find_item(i)
}

This is what I get from each item:

> find_item("Interstellar")

    Wikidata item search

Number of results:   10 

Results:
1    Interstellar (Q13417189) - 2014 US science fiction film 
2    Interstellar (Q6057099) 
3    interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space 
4    space colonization (Q686876) - concept of permanent human habitation outside of Earth 
5    rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly 
6    interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy 
7    interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars 
8    Interstellar Boundary Explorer (Q835898) 
9    starship (Q2003852) - spacecraft designed for interstellar travel 
10   interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet 
> 

Unfortunately, the information that I get from find_item (see below) has two problems:

  1. it is not a dataframe with all wikidata information of the item I am searching but a list of what seems to be metadata (wikidata's id, link...).
  2. it does not have the information I need (wikidata's properties from each particular wikidata item).

Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:

> find_property("genre")

    Wikidata property search

Number of results:   4 

Results:
1    genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic 
2    radio format (P415) - describes the overall content broadcast on a radio station 
3    sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279) 
4    gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus 

This has similar problems:

  1. it is not a dataframe
  2. it just stores metadata about the property
  3. I don't find any way to link each property with each object in movies vector.

Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)

Upvotes: 3

Views: 364

Answers (1)

DJJ
DJJ

Reputation: 2539

These are just lists. you can get a picture with str(find_item("Interstellar")) for example.

Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label

 a <- find_item("Interstellar")
 b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
 data.frame(b)


##           X1                             X2
## 1  Q13417189                   Interstellar
## 2   Q6057099                   Interstellar
## 3     Q41872            interstellar medium
## 4    Q686876             space colonization
## 5    Q167910                   rogue planet
## 6   Q1054444             interstellar cloud
## 7    Q834826            interstellar travel
## 8    Q835898 Interstellar Boundary Explorer
## 9   Q2003852                       starship
## 10  Q2441216            interstellar object

This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.

Reduce("rbind",lapply(a, 
                      function(x) cbind(x$title,
                                        x$label,
                                        ifelse(length(x$description)==0,NA,x$description))))

Upvotes: 2

Related Questions