vinu
vinu

Reputation: 35

python find and match exact string

How can I match a string from a column in a .csv file against a list of defined datatypes?

The list of data types is:

Datatype = ["M", "B", "E", "A", "DF", "DW", "DL", "DT", "XTEXT", "MDEDATA"]

I want to use the data type that matches the column to look up a value in a dictionary.

What works

When my code finds something like DW1 in the input (see below for examples) it writes DW:5 to the output file, and when it finds DT34 it writes DT:7. This works fine.

What doesn't work

But the problem is, whenever my code comes across XTEXT it matches it first to E: 2 (because E is present in XTEXT) and then to XTEXT: 4105 in the next row. As a result, the corresponding line in the output file contains 2 instead of 4105. Something similar happens for MDEDATA which is resolved to A:3.

I tried using a regex to remove numbers from the relevant columns in the input file (this line is commented out in the code above), like this:

if (Datatype[n] == re.sub('[1234567890\n]', '', line.split(";")[1])):
    # ...

but this causes problems when encountering strings like X256DATA.

Another issue is that the output file that is produced currently displays abc;DW1 in the first column and 5 in the second column, but I want abc in the first column, DW1 in the second and 5 in the third column!

How can I solve these problems? Any help appreciated.


CSV example

Input:

1 abc DW1  
2 uz5 XTEXT

Current output:

1 abc DW1    5
2 uz5 XTEXT  4105

TXT example

Input:

abc;DW1\n  
uz5;XTEXT\n

Current output:

abc;DW1,5\n  
uz5;XTEXT,4105\n    

Desired output

abc;DW1;5\n  
uz5;XTEXT;4105\n

Upvotes: 1

Views: 2431

Answers (3)

itsjeyd
itsjeyd

Reputation: 5280

OK, here is a solution based on your code that should work for the format of your .txt files:

import csv

# Define data_type / dict1 / dict2:

data_type = [
    "M", "E", "B", "A", "DF", "DW", "DL", "DT", "PF", "PW", "PL", "PT",
    "XTEXT", "MDEDATA", "X8DATA", "X16DATA", "X32DATA", "X64DATA",
    "X128DATA", "X256DATA", "X512DATA", "X1024DATA", "X2048DATA",
    "X4096DATA", "X8912DATA"]

dict1 = {
    "M": 1, "E": 2, "A": 3, "DF": 4, "DW": 5, "DL": 6, "DT": 7, "PF": 8,
    "PW": 9, "B": 10, "PL": 11, "PT": 12, "XTEXT": 4105, "MDEDATA": 4110,
    "X8DATA": 10000, "X16DATA": 10001, "X32DATA": 10002, "X64DATA": 10003,
    "X128DATA": 10004, "X256DATA": 10005, "X512DATA": 10006,
    "X1024DATA": 10007, "X2048DATA": 10008, "X4096DATA": 10009,
    "X8912DATA": 10010}

dict2 = {
    "M": "B", "E": "B", "A": "B", "DF": ">f","PF": ">f", "DW": ">h",
    "PW": ">h", "DL": ">l", "PL": ">l", "DT": "String", "PT": "String",
    "B": "B", "XTEXT": "XTEXT", "MDEDATA": "MDEDATA", "X8DATA": "X8DATA",
    "X16DATA": "X16DATA", "X32DATA": "X32DATA", "X64DATA": "X64DATA",
    "X128DATA": "X128DATA", "X256DATA": "X256DATA", "X512DATA": "X512DATA",
    "X1024DATA": "X1024DATA", "X2048DATA": "X2048DATA",
    "X4096DATA": "X4096DATA", "X8912DATA": "X8912DATA"}

# Obtain writer for output file:

source = 'GSV.txt'
dest = open('GSV_copy.txt', 'w')

# Process data:

import re

source_lines = open(source).readlines()
for line in source_lines:
    x = line.split(';')[1]
    icon_lletter = ''
    icon_lnummer = 0
    python_letter = ''
    for t in data_type:
        if x.startswith(t): # Alternative using re: "if re.match(t, x)"
            icon_lletter = t
            break
    if (icon_lletter not in dict1):
        print "Entry (%s) in Dictionary1 is not available" % (icon_lletter)
    else:
        icon_lnummer = dict1[icon_lletter]
    if (icon_lletter not in dict2):
        print "Entry (%s) in Dictionary2 is not available" % (icon_lletter)
    else:
        python_letter = dict2[icon_lletter]
        print "Converted the GVR datatype to python datatype is %s : %s : %s" \
            % (icon_lletter, python_letter, icon_lnummer)
        # Write line to output file:
        dest.write(';'.join([line.strip('\n'), str(icon_lnummer) + '\n']))

dest.close()

Please note:

  • With this solution I am trying to fix your problem while staying close to your original implementation. There are different ways of achieving your goal that are probably more Pythonic / efficient / DRY.

  • Since your input and ouput files do not have the same number of columns and differ in terms of formatting, you will eventually want to turn the code into a (set of) function(s) that take the things that differ (file type and column to check) as parameters.


Edit:

Working on this some more I noticed that the output in dest was a bit garbled. (I hadn't gotten that far before :)) I've changed the code to produce the exact output you want now.


Another Edit (based on discussion in comments):

On the topic of matching data types in the input file(s) against the list of defined data_types: If the DTs in your input file always have the numbers that you don't want to consider at the beginning or at the end, you could use

re.sub('^[0-9]+', '', x)

or

re.sub('[0-9]+$', '', x)

to remove them.

Upvotes: 1

vinu
vinu

Reputation: 35

As suggested by itsjeyd,

re.sub('^[0-9]+', '', x) or re.sub('[0-9]+$', '', x) does the trick!

Upvotes: 0

Janne Karila
Janne Karila

Reputation: 25207

Normal dictionary lookup matches the key exactly. If you get multiple matches you are doing something else.

>>> dict1 = {"M":1, "E":2, "A":3, "DF":4, "DW":5, "DL":6, "DT":7, "B":10, "XTEXT":4105, "MDEDATA":4110}
>>> dict1["XTEXT"]
4105

Now from the updated question I see you are not really looking for exact matches, as DW1 in file should match DW in Datatype. If you want to check if the beginning of the string matches, use str.startswith:

>>> "DW1".startswith("DW")
True

Note that you must check MDEDATA before M to avoid a false match. More generally, check longer matches before shorter ones.

Upvotes: 1

Related Questions