Raghav Goyal
Raghav Goyal

Reputation: 51

Problem in splitting a string list into different columns

I have written a code to extract the required tables from text files. The code is as follows:

def find_no_line_start_table(table_title,splited_data):
    found_no_lines = []
    for index, line in enumerate(splited_data):
        if table_title in line:
            found_no_lines.append(index)

    return found_no_lines[0]

def get_start_data_table(table_start, splited_data):
    for index, row in enumerate(splited_data[table_start:]):
        if 'TOTAL' in row:
            return table_start + index

def get_end_table(start_table_data, splited_data ):
    for index, row in enumerate(splited_data[start_table_data:]):
            if END_TABLE_LINE in row:
                return start_table_data + index

import requests
import pandas as pd
txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/vd66wc47s/sq87c742w/latest.txt'
).text
splited_data = txt_data.split('\n')
table_title = ' U.S. Rice Supply and Use  1/'
END_TABLE_LINE = 'LONG GRAIN RICE'
table_start = find_no_line_start_table(table_title,splited_data)
a = get_start_data_table(table_start, splited_data)
b = get_end_table(a, splited_data)
alpa = splited_data[a:b]
alpa
['TOTAL RICE\r',
 'Area Planted                      2.46          2.95          2.54          2.54\r',
 'Area Harvested                    2.37          2.92          2.48          2.48\r',
 '\r',
 '                                                  Pounds\r',
 'Yield per Harvested Acre          7507          7692          7563          7616\r',
 '\r',
 '                                               Million Hundredweight\r',
 '  Beginning Stocks 2/             46.0          29.4          44.9          44.9\r',
 '  Production                     178.2         224.2         187.3         188.6\r',
 '  Imports                         27.5          29.0          29.6          29.6\r',
...

Is there a way to convert this alpha into a dataframe with 5 different columns, with first one as heading and remaining 4 columns containing values. I have tried different methods like str.split, converting into pd.dataframe. However, none seems to work. Any inputs will be helpful.

@0range, what if some variables have missing information.

txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/t722h923j/9593tv50h/wasde-05-11-1995.txt').text
splited_data = txt_data.split('\n')
table_title = ' U.S. Rice Supply and Use'
END_TABLE_LINE = 'LONG GRAIN'
table_start = find_no_line_start_table(table_title,splited_data)
a = get_start_data_table(table_start, splited_data)
g = get_end_table(a, splited_data)
alpa = splited_data[a:g] 

Here using the dictionary method suggested by you, I am not able to properly extract the last variable i.e. price

Upvotes: 1

Views: 145

Answers (1)

0range
0range

Reputation: 2158

So to restate your question in a more general way: If I understand correctly, you want to:

  • transform an array of strings into a dataframe
  • where the strings contain values separated by a number of spaces (at least two between each two values)
  • such that the lines are the columns
  • with the first element in each line being the column namedtuple
  • and the remaining elements (4 for each line) being the values for the rows for this column in order.

Dataframes can among other ways be generated from dicts where all values are lists of identical length; each of these will be transformed into a column. For this, you can first transform the list of strings into an appropriately formed dict. You will also have to clean the strings a bit before.

I will not run your entire code but start with the data structure you received at the end of your script.

import pandas as pd

""" Example data"""
alpa = ['TOTAL RICE\r',
 'Area Planted                      2.46          2.95          2.54          2.54\r',
 'Area Harvested                    2.37          2.92          2.48          2.48\r',
 '\r',
 '                                                  Pounds\r',
 'Yield per Harvested Acre          7507          7692          7563          7616\r',
 '\r',
 '                                               Million Hundredweight\r',
 '  Beginning Stocks 2/             46.0          29.4          44.9          44.9\r',
 '  Production                     178.2         224.2         187.3         188.6\r',
 '  Imports                         27.5          29.0          29.6          29.6\r']

""" Prepare dict"""
alpadict = {}

""" Parse data"""
for i in range(len(alpa)):
    """ Remove leading and trailing spaces, newlines, etc."""
    alpa[i] = alpa[i].strip()
    
    """ Remove long whitespaces. Only max 2 successive whitespaces remain."""
    while "   " in alpa[i]:
        alpa[i] = alpa[i].replace("   ", "  ")
    
    """ Separate elements"""
    alpa[i] = alpa[i].split("  ")
    
    """ Record into dict (first element as key, others as values in lists)"""
    alpadict[alpa[i][0]] = alpa[i][1:]
    
    """ Add empty elements so that all dict entry lists are of equal length 4"""
    while len(alpadict[alpa[i][0]]) < 4:
        alpadict[alpa[i][0]].append("")
        
""" Transform to dataframe"""
df = pd.DataFrame.from_dict(alpadict)

The result is going to be:

print(df)

#  TOTAL RICE Area Planted Area Harvested   Pounds Yield per Harvested Acre Million Hundredweight Beginning Stocks 2/ Production Imports
#0                    2.46           2.37                              7507                                      46.0      178.2    27.5
#1                    2.95           2.92                              7692                                      29.4      224.2    29.0
#2                    2.54           2.48                              7563                                      44.9      187.3    29.6
#3                    2.54           2.48                              7616                                      44.9      188.6    29.6

Is this what you wanted?

(As you can see, you may have to deal with the labels that span multiple lines, e.g. "Pounds Yield per Harvested Acre".)

Edit: Dealing with missing values

In this case, I am afraid you will have to assess for each string of whitespaces if this is supposed to be only one field separator or two. Given that the values are nicely above one another in your example, you may be able to work with fixed positions in the string to separate the fields. E.g.:

""" Prepare dict"""
alpadict = {}

""" Parse data"""
for i in range(len(alpa)):
    """ Remove leading and trailing spaces, newlines, etc."""
    alpa[i] = alpa[i].strip()
    
    """ Split strings on particular points."""
    alpa[i] = [alpa[i][:30], alpa[i][30:43], alpa[i][43:57], alpa[i][57:70], alpa[i][70:]]
    
    """ Remove leading and trailing whitespace from all elements"""
    alpa[i] = [a.strip() for a in alpa[i]]
        
    """ Record into dict (first element as key, others as values in lists)"""
    alpadict[alpa[i][0]] = alpa[i][1:]
            
""" Transform to dataframe"""
df = pd.DataFrame.from_dict(alpadict)

Upvotes: 1

Related Questions