Reputation: 135
I have 32 separate html files with data in a table like format containing 8 columns of data. Each file is for a certain species of fungi.
I need to convert the 32 html files into 32 csv files with the data. I have the script for a single file, but can't figure out how to systematically do this with a few commands for all 32 files, instead of running the command I have 32 times.
Here is the script I am using in an attempt to make it loop through all 32 files:
directory = r'../html/species'
data = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
HTML_data = soup.find_all("table")[0].find_all("tr")[1:]
for element in HTML_data:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
data
Here is some output data from the script above simplified for replication purposes:
[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Kenya',
'Present',
'',
'Introduced',
'',
'',
'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Malawi, Ministry of Agriculture (1990)',
''],
['Mozambique',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
''],
['Nigeria',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
''],
['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Casulli (1979); Martin et al. (1997)',
''],
['Zambia',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
''],
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Ethiopia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Libya',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Morocco',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Mozambique',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['South Africa',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Sudan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
['Uganda',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['Afghanistan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Armenia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Azerbaijan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]
What I think I need is every species to be formatted more like this.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] or in my output I need:
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
'']], # AN EXTRA SQUARE BRACKET RIGHT HERE
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
Upvotes: 2
Views: 693
Reputation: 28565
Have you considered just reading in the table tags with pandas?
import pandas as pd
import os
directory = r'../html/species'
for filename in os.listdir(directory):
if filename.endswith('.html'):
csv_filename = filename.replace('.html','.csv')
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
table = pd.read_html(f.read())[0]
table.to_csv(csv_filename, index=False)
print(data)
Upvotes: 1