Reputation: 43
I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. See the files here. If necessary save the two .pdf files and the script below into the same folder in your computer and run the script to reproduce the problem.
I want to extract just the table content from each file, join and transform it into a tibble and insert one first column (into a tibble) with dates extracted from header files.
So, if the first 5 lines in the tibble come from the first pdf file than the first 5 lines in the first column had to be filled in with the same date extract from the header of first file. If the next 2 lines after previous 5 ones come from the second file than these 2 lines in the first column had to be filled in with the same date extract from the header of the second file.
I've already extracted the table from each files, join and create a tibble as you can see below. Even create a code to extract the dates. But really, I don't know how to associate the date extracted from header to a table content of each file and insert it into the tibble.
Code - Extract Table Information
## EXTRACT PDF FILE INFORMATION AND GENERATE A CLEAN DATASET
# load library
library(pdftools)
library(tidyverse)
# create a list with all file names
file_names <- dir(pattern = 'N.*')
# extract text from each file and append into a list
text_raw <- list()
for (i in 1:length(file_names)) {
doc <- pdf_text(file_names[i])
text_raw <- append(text_raw, doc)
}
# clean data
text_clean <- text_raw %>%
str_split('\r\n') %>%
unlist() %>%
as.vector() %>%
str_to_lower() %>%
str_squish() %>%
str_subset('1-bovespa') %>%
str_replace('1-', '') %>%
str_remove_all('#2?|on|nm|sa') %>%
str_squish()
# convert as tibble
df <- tbl_df(text_clean)
# split column
df <- separate(df,
value,
c('c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'),
sep = ' ')
print(df)
Code - Extract Dates
# filter dates
dates <- text_raw %>%
str_split('\r\n') %>%
unlist() %>%
as.vector() %>%
str_squish() %>%
str_subset('\\d{4}\\s\\d{1}\\s\\d{2}\\/\\d{2}\\/\\d{4}$') %>%
str_remove_all('(\\d+\\s\\d{1}\\s)')
print(dates)
c1 c2 c3 c4 c5 c6 c7 c8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 bovespa c vista cielo 800 10,79 8.632,00 d
2 bovespa c vista cielo 200 10,79 2.158,00 d
3 bovespa c vista brf 400 23,81 9.524,00 d
4 bovespa c vista brf 100 23,81 2.381,00 d
c1 c2 c3 c4 c5 c6 c7 c8 c9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10/01/2019 bovespa c vista cielo 800 10,79 8.632,00 d
2 10/01/2019 bovespa c vista cielo 200 10,79 2.158,00 d
3 18/01/2019 bovespa c vista brf 400 23,81 9.524,00 d
4 18/01/2019 bovespa c vista brf 100 23,81 2.381,00 d
Any help?
Upvotes: 1
Views: 359
Reputation: 98
df$c0=dates
print(df)
hi,I am a chinese.
you should just rename the colname :
colnames(df)=c("c2","c3","c4","c5","c6","c7","c8","c9")
df$c1=dates
print(df)
Upvotes: -1
Reputation: 263481
I thought the effort to extract dates was unnecessarily complex, not to mention the fact that it appears to have worked for some of us but failed for my running of the code. Instead I constructed a date-pattern and extracted with stringi::stri_extract
:
stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
[1] "18/01/2019" # this pattern designed for this century dates in the DD/MM/YYYY format
dates <- stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
df$C9 <- dates
Furthermore, since there were multiple matches for the date pattern in each pdf, it would be safer to do the extraction before appending the text together and then you could use only the first values.
Upvotes: 0