Reputation: 67
I read a PDF file with PDFMiner and I get a string; following that structure:
text
text
text
col1
1
2
3
4
5
col2
(1)
(2)
(3)
(7)
(4)
col3
name1
name2
name3
name4
name5
col4
name
5
45
7
87
8
col5
FAE
EFD
SDE
FEF
RGE
col6
name
45
7
54
4
130
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
7
1
8
text1
text1
text1
col1
6
7
8
9
10
col2
(1)
(2)
(3)
(7)
(4)
col3
name6
name7
name8
name9
name10
col4
name
54
4
78
8
86
col5
SDE
FFF
EEF
GFE
JHG
col6
name
6
65
65
45
78
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
4
1
54
I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name. But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe. For example for col1 I would have in the dataframe:
1
2
3
4
5
6
7
8
9
10
I also have some empty columns (col8 in my example) and I am not sure how to deal with it.
Any idea? thanks!
Upvotes: 0
Views: 58
Reputation: 195573
You can use regex to parse the document (regex101), for example (txt
is your string from the question):
import re
d = {}
for col_name, cols in re.findall(r'\n^((?:#\s)?col\d+(?:\n\s*name\n+)?)(.*?)(?=\n\n|^(?:#\s)?col\d+|\Z)', txt, flags=re.M|re.S):
d.setdefault(col_name.strip(), []).extend(cols.strip().split('\n'))
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
Prints:
col1 col2 col3 col4\n name col5 col6\n name # col7 col8 col9 col10\nname
0 1 (1) name1 5 FAE 45 16 55 1
1 2 (2) name2 45 EFD 7 18 30 7
2 3 (3) name3 7 SDE 54 22 None 60 1
3 4 (7) name4 87 FEF 4 17 None 1 8
4 5 (4) name5 8 RGE 130 25 None 185 1
5 6 (1) name6 54 SDE 6 16 None 55 4
6 7 (2) name7 4 FFF 65 18 None 30 1
7 8 (3) name8 78 EEF 65 22 None 60 54
8 9 (7) name9 8 GFE 45 17 None 1 None
9 10 (4) name10 86 JHG 78 25 None 185 None
Upvotes: 3