Reputation: 1599

Extract source metadata from downloaded file

I have a bunch of pdf files which I downloaded. Now I want to extract the download url from the file's metadata. How do I do this programmatically? I prefer solutions in R and I'm working on MacOS Mojave.

If you want to reproduce you can [use this file].

Upvotes: 4

Answers (3)

Phung Dao

Reputation: 11

I know this question is a bit old but since you're using Mac there is an easier way available with minimal coding. Even no coding IF the source url metadata shows when you cmd+i the file

then NameMangler can extract it. Just batch process it and extract the url metadata and format it into however you wish or leave as is and set the filename to be renamed with the url metadata.

After the batch process is completed, select all files and open excel: paste > paste special > text into an excel spreadsheet

The same could technically be achieved using the built in Automator utility.

Upvotes: 0

IRTFM

Reputation: 263342

I tried searching Ask Different for ways of emulating the choice of "Get Info" from a Terminal.app command line.

I found advice to use the command mdls and I get this from an R system-call:

system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf")

#kMDItemWhereFroms = (
#   "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
#    ""
#)

To get that multi-line result into R (rather than just appearing at the console) you need to add the intern=TRUE parameter to the system call:

> res <- system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf", intern=TRUE)
> res
[1] "kMDItemWhereFroms = ("                                                                                                                 
[2] "    \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","
[3] "    \"\""                                                                                                                              
[4] ")"                                                                                                                                     
> res[2]
[1] "    \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","

To get all the attributes:

system("mdls ~/0.-miljoenennota.pdf")

#-----------
_kMDItemOwnerUserID            = 501
kMDItemAuthors                 = (
    "Tweede Kamer der Staten-Generaal"
)
kMDItemContentCreationDate     = 2018-10-08 23:45:35 +0000
kMDItemContentModificationDate = 2018-10-08 23:45:46 +0000
kMDItemContentType             = "com.adobe.pdf"
kMDItemContentTypeTree         = (
    "com.adobe.pdf",
    "public.data",
    "public.item",
    "public.composite-content",
    "public.content"
)
kMDItemCreator                 = "XPP"
kMDItemDateAdded               = 2018-10-08 23:45:46 +0000
kMDItemDisplayName             = "0.-miljoenennota.pdf"
kMDItemEncodingApplications    = (
    "Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"
)
kMDItemFSContentChangeDate     = 2018-10-08 23:45:46 +0000
kMDItemFSCreationDate          = 2018-10-08 23:45:35 +0000
kMDItemFSCreatorCode           = ""
kMDItemFSFinderFlags           = 0
kMDItemFSHasCustomIcon         = (null)
kMDItemFSInvisible             = 0
kMDItemFSIsExtensionHidden     = 0
kMDItemFSIsStationery          = (null)
kMDItemFSLabel                 = 0
kMDItemFSName                  = "0.-miljoenennota.pdf"
kMDItemFSNodeCount             = (null)
kMDItemFSOwnerGroupID          = 20
kMDItemFSOwnerUserID           = 501
kMDItemFSSize                  = 4004668
kMDItemFSTypeCode              = ""
kMDItemKind                    = "Portable Document Format (PDF)"
kMDItemLogicalSize             = 4004668
kMDItemNumberOfPages           = 196
kMDItemPageHeight              = 841.89
kMDItemPageWidth               = 595.276
kMDItemPhysicalSize            = 4005888
kMDItemSecurityMethod          = "None"
kMDItemVersion                 = "1.6"
kMDItemWhereFroms              = (
    "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
    ""
)

I was also able to get what might be a different definition of "metadata" with:

install.packages("tabulizer", dependencies=TRUE)
tabulizer::extract_metadata("~/0.-miljoenennota.pdf")
#---------
$pages
[1] 196

$title
NULL

$author
[1] "Tweede Kamer der Staten-Generaal"

$subject
[1] ""

$keywords
[1] ""

$creator
[1] "XPP"

$producer
[1] "Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"

$created
[1] "Thu Sep 15 05:11:50 PDT 2016"

$modified
[1] "Thu Sep 15 05:34:06 PDT 2016"

$trapped
NULL

Upvotes: 1

hrbrmstr

Reputation: 78792

While you could have avoided the need for this by using R to programmatically download the PDFs, we can use the xattrs package to get to the data you seek:

library(xattrs) # https://gitlab.com/hrbrmstr/xattrs (not on CRAN)

Let's see what extended attributes are available for this file:

xattrs::list_xattrs("~/Downloads/0.-miljoenennota.pdf")
## [1] "com.apple.metadata:kMDItemWhereFroms"
## [2] "com.apple.quarantine"

com.apple.metadata:kMDItemWhereFroms looks like a good target:

xattrs::get_xattr(
  path = "~/Downloads/forso/0.-miljoenennota.pdf",
  name = "com.apple.metadata:kMDItemWhereFroms"
) -> from_where

from_where
## [1] "bplist00\xa2\001\002_\020}https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdfP\b\v\x8b"

But, it's in binary plist format (yay Apple #sigh). However, since that's "a thing" the xattrs package has a read_bplist() function, but we have to use get_xattr_raw() to use it:

xattrs::read_bplist(
  xattrs::get_xattr_raw(
    path = "~/Downloads/forso/0.-miljoenennota.pdf",
    name = "com.apple.metadata:kMDItemWhereFroms"
  )
) -> from_where

str(from_where)
## List of 1
##  $ plist:List of 1
##   ..$ array:List of 2
##   .. ..$ string:List of 1
##   .. .. ..$ : chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf"
##   .. ..$ string: list()
##   ..- attr(*, "version")= chr "1.0"

The ugly, nested list is the fault of the really dumb binary plist file format, but the source URL is in there.

We can get all of them this way (I tossed a bunch of random interactively downloaded PDFs into a directory for this) by using lapply. There's also an example of this in this blog post but it uses reticulate and a Python package to read the binary plist data instead of the built-in package function to do that (said built-in package function is a wrapper to the macOS plutil utility or linux plistutil utility; Windows users can switch to a real operating system if they want to use that function).

fils <- list.files("~/Downloads/forso", pattern = "\\.pdf", full.names = TRUE)

do.call(
  rbind.data.frame,
  lapply(fils, function(.x) {

    xattrs::read_bplist(
      xattrs::get_xattr_raw(
        path = .x,
        name = "com.apple.metadata:kMDItemWhereFroms"
      )
    ) -> tmp

    from_where <- if (length(tmp$plist$array$string) > 0) {
      tmp$plist$array$string[[1]]
    } else {
      NA_character_
    }

    data.frame(
      fil = basename(.x),
      url = from_where,
      stringsAsFactors=FALSE
    )

  })
) -> files_with_meta

str(files_with_meta)
## 'data.frame': 9 obs. of  2 variables:
##  $ fil: chr  "0.-miljoenennota.pdf" "19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "Codebook.pdf" "Elementary-Lunch-Menu.pdf" ...
##  $ url: chr  "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf" "http://eprint.ncl.ac.uk/file_store/production/230123/19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "http://apps.start.umd.edu/gtd/downloads/dataset/Codebook.pdf" "http://www.msad60.org/wp-content/uploads/2017/01/Elementary-February-Lunch-Menu.pdf" ...

NOTE: IRL you should likely do more bulletproofing in the example lapply.

Upvotes: 5

Extract source metadata from downloaded file

Answers (3)

Related Questions