d9120
d9120

Reputation: 511

How should I parse this data?

I'm currently importing some csv and parsing the data using python. One column in particular has different formats and I'm not exactly sure which approach I should use to parse the data.

The data in that column can have the following formats

5.00 LBS
5.00LBS
5.00lbs
5LBS
5 LBS
5 kg

or any other unit

What I would like to do is store the number in one variable and the unit in another so I can later use them in my Django Model and create an object.

import csv
def import_measurements(request):
    path = 'measurements.csv'
    with open(path) as f:
        reader = csv.reader(f)
        for row in reader:
            # unit = row[0] string portion
            # weight =  row[0] number portion
            created = Measurements.objects.create(
                           unit=unit,
                           weight=weight
            )

Any help would be appreciated

Upvotes: 0

Views: 43

Answers (2)

Hackaholic
Hackaholic

Reputation: 19733

you can use regx:

>>> doc
'5.00 LBS\n5.00LBS\n5.00lbs\n5LBS\n5 LBS\n5 kg'
>>> re.findall('(\d+\.?\d*)\s*([a-zA-Z]+)',doc)
[('5.00', 'LBS'), ('5.00', 'LBS'), ('5.00', 'lbs'), ('5', 'LBS'), ('5', 'LBS'), ('5', 'kg')]

Upvotes: 1

njzk2
njzk2

Reputation: 39386

Apparently your value is composed of 2 parts: a number (decimal or not), and a unit (lbs or kg), separated by an optional space.

I would use a regex in the form:

(\d(?:\.\d{2})?)\s?((?:lbs)|(?:kg))

Using the i flag to ignore the case.

The first group would contain 5 or 5.00, and the second, the unit. Lowercase the unit, and there you have it.

Left as an exercise to the reader the possibility to handle various decimal separators (, for example), and to support thousand separators.

A more relaxed but therefore less strict expression could be:

([0-9.]+)\s?(\w+)

Upvotes: 1

Related Questions