Reputation: 1
I have read tables from pdf using tabula-py command with the following code:
table = tabula.read_pdf(files[0],pages = 'all',multiple_tables = True, stream = True)
Sometimes values from two columns are joined into a single column(separated by single space). For example:
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 |
---|---|---|---|---|---|---|---|
a1 | b1 c1 | d1 | e1 f1 | g1 | h1 | NA | NA |
a2 | b2 | c2 | d2 | e2 | f2 | g2 | h2 |
How can i readjust the values into the correct columns, to get:
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 |
---|---|---|---|---|---|---|---|
a1 | b1 | c1 | d1 | e1 | f1 | g1 | h1 |
a2 | b2 | c2 | d2 | e2 | f2 | g2 | h2 |
Upvotes: 0
Views: 65
Reputation: 31146
import io
df = pd.read_csv(io.StringIO("""col0 col1 col2 col3 col4 col5 col6 col7
a1 b1 c1 d1 e1 f1 g1 h1 NA NA
a2 b2 c2 d2 e2 f2 g2 h2"""), sep="\t")
df = pd.read_csv(io.StringIO(df.to_csv(sep=" ").replace("\"", "")), sep="\s+")
col0 col1 col2 col3 col4 col5 col6 col7
a1 b1 c1 d1 e1 f1 g1 h1
a2 b2 c2 d2 e2 f2 g2 h2
Upvotes: 2
Reputation: 544
Could you try
table = tabula.read_pdf(files[0],pages = 'all',multiple_tables = True,guess = False, stream = True)
Upvotes: 0