Reputation: 51
I have written a code to extract the required tables from text files. The code is as follows:
def find_no_line_start_table(table_title,splited_data):
found_no_lines = []
for index, line in enumerate(splited_data):
if table_title in line:
found_no_lines.append(index)
return found_no_lines[0]
def get_start_data_table(table_start, splited_data):
for index, row in enumerate(splited_data[table_start:]):
if 'TOTAL' in row:
return table_start + index
def get_end_table(start_table_data, splited_data ):
for index, row in enumerate(splited_data[start_table_data:]):
if END_TABLE_LINE in row:
return start_table_data + index
import requests
import pandas as pd
txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/vd66wc47s/sq87c742w/latest.txt'
).text
splited_data = txt_data.split('\n')
table_title = ' U.S. Rice Supply and Use 1/'
END_TABLE_LINE = 'LONG GRAIN RICE'
table_start = find_no_line_start_table(table_title,splited_data)
a = get_start_data_table(table_start, splited_data)
b = get_end_table(a, splited_data)
alpa = splited_data[a:b]
alpa
['TOTAL RICE\r',
'Area Planted 2.46 2.95 2.54 2.54\r',
'Area Harvested 2.37 2.92 2.48 2.48\r',
'\r',
' Pounds\r',
'Yield per Harvested Acre 7507 7692 7563 7616\r',
'\r',
' Million Hundredweight\r',
' Beginning Stocks 2/ 46.0 29.4 44.9 44.9\r',
' Production 178.2 224.2 187.3 188.6\r',
' Imports 27.5 29.0 29.6 29.6\r',
...
Is there a way to convert this alpha into a dataframe with 5 different columns, with first one as heading and remaining 4 columns containing values. I have tried different methods like str.split, converting into pd.dataframe. However, none seems to work. Any inputs will be helpful.
@0range, what if some variables have missing information.
txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/t722h923j/9593tv50h/wasde-05-11-1995.txt').text
splited_data = txt_data.split('\n')
table_title = ' U.S. Rice Supply and Use'
END_TABLE_LINE = 'LONG GRAIN'
table_start = find_no_line_start_table(table_title,splited_data)
a = get_start_data_table(table_start, splited_data)
g = get_end_table(a, splited_data)
alpa = splited_data[a:g]
Here using the dictionary method suggested by you, I am not able to properly extract the last variable i.e. price
Upvotes: 1
Views: 145
Reputation: 2158
So to restate your question in a more general way: If I understand correctly, you want to:
Dataframes can among other ways be generated from dicts where all values are lists of identical length; each of these will be transformed into a column. For this, you can first transform the list of strings into an appropriately formed dict. You will also have to clean the strings a bit before.
I will not run your entire code but start with the data structure you received at the end of your script.
import pandas as pd
""" Example data"""
alpa = ['TOTAL RICE\r',
'Area Planted 2.46 2.95 2.54 2.54\r',
'Area Harvested 2.37 2.92 2.48 2.48\r',
'\r',
' Pounds\r',
'Yield per Harvested Acre 7507 7692 7563 7616\r',
'\r',
' Million Hundredweight\r',
' Beginning Stocks 2/ 46.0 29.4 44.9 44.9\r',
' Production 178.2 224.2 187.3 188.6\r',
' Imports 27.5 29.0 29.6 29.6\r']
""" Prepare dict"""
alpadict = {}
""" Parse data"""
for i in range(len(alpa)):
""" Remove leading and trailing spaces, newlines, etc."""
alpa[i] = alpa[i].strip()
""" Remove long whitespaces. Only max 2 successive whitespaces remain."""
while " " in alpa[i]:
alpa[i] = alpa[i].replace(" ", " ")
""" Separate elements"""
alpa[i] = alpa[i].split(" ")
""" Record into dict (first element as key, others as values in lists)"""
alpadict[alpa[i][0]] = alpa[i][1:]
""" Add empty elements so that all dict entry lists are of equal length 4"""
while len(alpadict[alpa[i][0]]) < 4:
alpadict[alpa[i][0]].append("")
""" Transform to dataframe"""
df = pd.DataFrame.from_dict(alpadict)
The result is going to be:
print(df)
# TOTAL RICE Area Planted Area Harvested Pounds Yield per Harvested Acre Million Hundredweight Beginning Stocks 2/ Production Imports
#0 2.46 2.37 7507 46.0 178.2 27.5
#1 2.95 2.92 7692 29.4 224.2 29.0
#2 2.54 2.48 7563 44.9 187.3 29.6
#3 2.54 2.48 7616 44.9 188.6 29.6
Is this what you wanted?
(As you can see, you may have to deal with the labels that span multiple lines, e.g. "Pounds Yield per Harvested Acre".)
Edit: Dealing with missing values
In this case, I am afraid you will have to assess for each string of whitespaces if this is supposed to be only one field separator or two. Given that the values are nicely above one another in your example, you may be able to work with fixed positions in the string to separate the fields. E.g.:
""" Prepare dict"""
alpadict = {}
""" Parse data"""
for i in range(len(alpa)):
""" Remove leading and trailing spaces, newlines, etc."""
alpa[i] = alpa[i].strip()
""" Split strings on particular points."""
alpa[i] = [alpa[i][:30], alpa[i][30:43], alpa[i][43:57], alpa[i][57:70], alpa[i][70:]]
""" Remove leading and trailing whitespace from all elements"""
alpa[i] = [a.strip() for a in alpa[i]]
""" Record into dict (first element as key, others as values in lists)"""
alpadict[alpa[i][0]] = alpa[i][1:]
""" Transform to dataframe"""
df = pd.DataFrame.from_dict(alpadict)
Upvotes: 1