Uzi
Uzi

Reputation: 453

Substring with inconsistent length

I have this quite long string that contains multiple information. So I guess we can say that its a couple of fields concatenated together without any delimiters. I understand that to make this work, all of the lengths of the fields should always be fixed. However, two of the fields represent a name and an amount and no prefixes/suffixes were implemented to maintain a fixed length.

I was wondering how would I got about this problem? Here's a sample of the string and how they should be separated:


Sample #1

Actual Input:
48001MCAbastillas2200800046300017100518110555130000123

How it should be separated:
480 | 01 | MCAbastillas | 2200800046300017 | 100518 | 110555 | 130000 | 123


Sample #2

Actual Input:
48004MCAbastillas22008000463000171005181105555000000123

How it should be separated:
480 | 04 | MCAbastillas | 2200800046300017 | 100518 | 110555 | 5000000 | 123

In my example only the amount has changed but I'm expecting that the name will vary in length as well. Any suggestion will be much appreciated.

Upvotes: 1

Views: 186

Answers (2)

Calaf
Calaf

Reputation: 1173

We want the output

480 | 01 | MCAbastillas | 2200800046300017 | 100518 | 110555 | 130000 | 123

where the fields 3 and 7 had no fixed length. Suppose that we store the string in a string var:

String s="48001MCAbastillas2200800046300017100518110555130000123";

We can find the fields 1 & 2 easly:

System.out.println(s.substring(0, 3)); //has 3 digit
System.out.println(s.substring(3, 5)); //has 2 digit
//we can reduce s
s=s.substring(6); //remove char from 0 to 5 included

If you'll call System.out.println(s); you well see

CAbastillas2200800046300017100518110555130000123

Now we have the string... I can deduce that it is composed only by char. So we have to find the first occurrence of a number... We can use a cycle:

int index=-1;

for( int i=0; i<s.length(); i++ ) {
    if( Character.isDigit(s.charAt(i))) {
        index=i;
        System.out.println("There is a number in the position "+ index);
        break;
    } 
}

Now you can extract your name with:

 System.out.println(s.substring(0, index));

and extract the other 3 fields (you can optimize this part...)

    System.out.println(s.substring(0, 16));
    s=s.substring(16); 

    System.out.println(s.substring(0, 6));
    s=s.substring(6); 

    System.out.println(s.substring(0, 6));
    s=s.substring(6); 

Finally, you can divide the remaining s in two part with length s.length.3 and 3:

    System.out.println(s.substring(0, s.length()-3));
    System.out.println(s.substring( s.length()-3,s.length()));

Your output will be:

480

01

There is a number in the position 11

CAbastillas

2200800046300017

100518

110555

130000

123

Upvotes: 0

Michael
Michael

Reputation: 44150

I'd probably use a regular expression for this.

String test = "48004MCAbastillas22008000463000171005181105555000000123";
Pattern pattern = Pattern.compile("^(\\d{3})(\\d{2})([A-Za-z]+)(\\d{16})(\\d{6})(\\d{6})(\\d+)(\\d{3})$");
Matcher matcher = pattern.matcher(test);
if (matcher.matches())
{
    for (int i = 1; i <= matcher.groupCount(); ++i)
    {
        System.out.print(matcher.group(i) + " | ");
    }
}

Sample output:

480 | 04 | MCAbastillas | 2200800046300017 | 100518 | 110555 | 5000000 | 123 |

Note that the third and second to last groups do not have fixed lengths.

It's more difficult if the name can contain numbers. My approach would be to run this against the data that you have and print a list of anything that doesn't match (i.e. add an else clause). Perhaps then you can come up with a better strategy for handling these cases. For example, something like ([A-Za-z]+\w*[A-Za-z]+) might be an improvement, because that will at least allow numbers in the middle of the name.

Sometimes you just have to accept that when the data you're given is crap, you just have to do the best that you can and that might mean throwing some of it away.

Upvotes: 2

Related Questions