Pete Loudovaris
Pete Loudovaris

Reputation: 11

How would I import data like this into R or Python?

I wish to import the data found here https://www.asx.com.au/data/shortsell.txt, so that it becomes a 691x7 table. How to I tell it to recognise the different columns?

Thanks in advance

Upvotes: 1

Views: 66

Answers (1)

Timus
Timus

Reputation: 11321

You most likely have solved your problem by now. If not, here's a suggestion: You could write some regex to read the lines, but that's way too much hassle in my opinion. It seems to me that the column-defining feature is their sizes. So you have to count a bit and the rest is straightforward. Make a first count and then visualise your result by doing something like this (that's code you're going to need later anyway):

with open('shortsell.txt', 'r') as file:
    [next(file) for _ in range(5)]  # Skip the first 5 rows

    for line in file:
        print(line[:8] + '|' + line[8:42] + '|' + line[42:56] + '|'
              + line[56:71] + '|' + line[71:90] + '|' + line[90:].rstrip())

Once you have the right sizes:

with open('shortsell.txt', 'r') as file:
    [next(file) for _ in range(5)]  # Skip the first 5 rows

    # Read the columns parts
    columns = [[line[:8].strip(), line[8:42].strip(), line[42:56].strip(),
                line[56:71].strip(), line[71:90].strip(), line[90:].strip()]
               for line in (next(file), next(file), next(file))]
    # Join the parts
    columns = [' '.join(columns[i][j] for i in range(3)).strip() for j in
               range(6)]

    # Read the data and cast to fitting type
    data = [[line[:8].strip(),
             line[8:42].strip(),
             line[42:56].strip(),
             int(line[56:71].strip().replace(',', '')),
             int(line[71:90].strip().replace(',', '')),
             float(line[90:].strip().replace(',', ''))]
            for line in file]

Result:

['ASX Code',
 'Company Name',
 'Product/ Class',
 'Reported Gross Short Sales (a) ASX + CHI-X',
 'Issued Capital (b)',
 '% of issued capital reported as short sold  (a)/(b)']
[['360', 'LIFE360 INC.', 'CDI FORUS', 8999, 148866201, 0.0],
 ['3DP', 'POINTERRA LIMITED', 'FPO', 15213, 670733112, 0.0],
 ['4DS', '4DS MEMORY LIMITED', 'FPO', 15000, 1310693486, 0.0],
  ...
 ['ZEL', 'Z ENERGY LIMITED.', 'FPO NZX', 23255, 520476853, 0.0],
 ['ZLD', 'ZELIRA THERAPEUTICS LIMITED', 'FPO', 101860, 1185322966, 0.0],
 ['ZNO', 'ZOONO GROUP LIMITED', 'FPO', 67213, 163612707, 0.04]]

Another option would be to use Pandas' pd.read_fwf(), a function designed to read fixed-width files:

import pandas as pd

with open('shortsell.txt', 'r') as file:
    [next(file) for _ in range(5)]  # Skip the first 5 rows
    df = pd.read_fwf(file, header=None)
df.columns = (
    df.iloc[0:3, :]
    .fillna("").astype(str)
    .apply(lambda col: " ".join(col.str.strip()).strip(), axis=0)
)
data = df.iloc[3:, :].values.tolist()

Upvotes: 2

Related Questions