Jake Stanger
Jake Stanger

Reputation: 449

How to find consecutive repeats of unknown substring

I'm trying to make a system to factorise (if that term is correct in Chemistry) a given expanded chemical formula, such as C6H2NO2NO2NO2CH3 into brackets so that it is C₆H₂(NO₂)₃CH₃. (Ignore the subscript, or lack thereof in the first instance). Problem is, I don't know what the repeated molecule is going to be, or even how long it will be. How would I find and count repeats?

For context, here's my code so far which generates the formula from a 2D list of elements:

private String getFormula(List<List<Element>> elements)
{
    String formula = ""; //TODO Switch out for StringBuilder

    for(List<Element> currentElement : elements)
    {
        formula += currentElement.get(0).getSymbol(); //Every element per list is identical, so looking at 0 will always be safe
        if(currentElement.size() > 1) formula += currentElement.size(); //Only display a number if there is more than 1 element
    }

    return formula;
}

Upvotes: 5

Views: 807

Answers (5)

robotlos
robotlos

Reputation: 536

EDIT Updated so that it considers order. Thanks @JakeStanger!
2nd EDIT Updated to reflect new condition where molecule ends with |


I used a regular expression to split by | since from the given String we know that a new molecule starts after |. I used a Hashmap to keep track of how many molecules of each type were given. In the end I iterated through each value in the Hashmap and appended to String result depending on whether it was one molecule or not. Cheers!

   public static String factorise(String input) {
        String result = "";
        Map<String, Integer> molecules = new LinkedHashMap<>();
        String[] res = input.split("\\|");
        for (String t : res) {
            //Check if we already have this element in our map
            if (!molecules.containsKey(t)) {
                //If not then add it and set the count to 1
                molecules.put(t, 1);
            } else {
                //If we do then update the count by 1
                molecules.put(t, molecules.get(t) + 1);
            }
        }
        //Iterate through each molecule
        for (String key : molecules.keySet()) {
            if (molecules.get(key) == 1) {
                //If the count is only at one, then we just need to append it.
                result += key;
            } else {
                //Otherwise, we need the parentheces and the number of repetitions followed after
                result = result + "(" + key + ")" + molecules.get(key);
            }
        }
        return result;
    }

Running

    System.out.println(factorise("C6|H2|NO2|NO2|NO2|CH3|OH|OH"));
    System.out.println(factorise("HO|HO"));

Yields the following when run:

run:
C6H2(NO2)3CH3(OH)2
(HO)2
BUILD SUCCESSFUL (total time: 0 seconds)

Upvotes: 2

mellamokb
mellamokb

Reputation: 56769

Here is an alternative solution that uses simple string parsing to find matching repeated substrings.

public static String factorise(String input) {

    StringBuilder result = new StringBuilder();

    for (int start = 0; start < input.length(); start++) {
        char c = input.charAt(start);
        if (c >= '0' && c <= '9') {
            result.append(c);
            continue;
        }

        boolean foundRepeat = false;
        for (int end = start + 1; end <= input.length(); end++) {
            int length = end - start;
            if (end + length > input.length()) break;

            String sub = input.substring(start, end);
            String nextsub = input.substring(end, end + length);
            int nextpos = end + length;
            int count = 1;

            while (sub.equals(nextsub)) {
                count++;
                if (nextpos + length > input.length()) break;

                nextsub = input.substring(nextpos, nextpos + length);
                nextpos += length;
            }

            if (count > 1) {
                result.append("(" + sub + ")" + count);
                start += length * (count) - 1;
                foundRepeat = true;
                break;
            }
        }

        if (!foundRepeat) {
            result.append(c);
        }
    }

    return result.toString();
}

Examples:

  • Input: Output
  • C6H2NO2NO2NO2CH3: C6H2(NO2)3CH3
  • CaOHOH: Ca(OH)2
  • C6H2OH2ONO: C6(H2O)2NO
  • C6H2NO2NO2NO2CH3OHOH: C6H2(NO2)3CH3(OH)2

Upvotes: 1

Rajeev Singh
Rajeev Singh

Reputation: 3332

Here is a another solution using Map & ArrayList instead of regex. I inserted every molecule in a Map and incremented its key value every time it appeared in the formula and a ArrayList to maintain order of molecules.

public static void main(String [] audi){
        String s="C66H2NO2NO2NO2CH3";
        Map<String, Integer > Formula=new HashMap<String, Integer >();
        List E = new ArrayList();

        String t="";int c=0,l=s.length();
        for(int i=0;i<l;i++)
        {
            t=t.concat(""+s.charAt(i));
            if(Character.isDigit(s.charAt(i)))
            {
                if(((i+1)<l && !Character.isDigit(s.charAt(i+1)) )|| i==l-1)
                {
                    int count = Formula.containsKey(t) ? Formula.get(t) : 0;
                    Formula.put(t, count + 1);
                    if(!E.contains(t))
                    E.add(t);
                    t="";
                }
            }
        }
        //display
        for(int i=0;i<E.size();i++){
            if(Formula.get(E.get(i))>1)
                System.out.print("("+E.get(i) + ")"+Formula.get(E.get(i)));
            else
                System.out.print(E.get(i));
        }
}

Output: C6H2(NO2)3CH3

Upvotes: 0

Laurel
Laurel

Reputation: 6173

This answer gives the code: String result = source.replaceAll("(.+)\\1+", "$1"), which replaces all repeated substrings.

I'm thinking that a slight modification of that code should do what you want. If you use "($1)" as the replacement, it will wrap the match in parenthesis. You could probably step through the replacement and determine what number should come after the parenthesis.

To prevent the regex from capturing preceding numbers, try "([A-Za-z]+[1-9]*)\\1+".

This link explains how to count number of matches. It's a bit more complex:

 Pattern pattern = Pattern.compile("([A-Za-z]+[1-9]*)\\1+");
    Matcher  matcher = pattern.matcher(YOUR_CHEM_STRING);

   int count = 0;
   String prior="";
    while (matcher.find()){
       if(m.group().equals(prior){
             count++;
       }else{
           YOUR_CHEM_STRING.replaceAll("([A-Za-z]+[1-9]*)\\1+","($1)"+count);
           count=0;
       }
    }

Upvotes: 2

Matthew Wright
Matthew Wright

Reputation: 1525

You count split up the formula elements into a list and then parse it, counting consecutive repeats. When the count is greater than 1 you will add it with parenthesis.

String formula = "C6H2NO2NO2NO2CH3";
Pattern p = Pattern.compile("[a-zA-Z]+[0-9]+");
Matcher m = p.matcher(formula);
List<String> parts = new ArrayList<String>();
while(m.find()) {
    parts.add(m.group());
}

String shrink = "";
int count = 0;
for(int i=0; i<parts.size(); i++) {
    count++;
    if(i+1 == parts.size() || !parts.get(i+1).equals(parts.get(i))) {
        if(count == 1)
            shrink += parts.get(i);
        else
            shrink += "("+parts.get(i)+")"+count;
        count = 0;
    }
}
System.out.println(shrink); // result = "C6H2(NO2)3CH3"

If you are able to send the elements list, try this:

public static String shortForumla(List<List<Element>> elements) {
    String shrink = "";
    int count = 0;
    for(int i=0; i<elements.size(); i++) {
        String symbol = elements.get(i).get(0).symbol();
        if(i+1 == elements.size() || !elements.get(i+1).get(0).symbol().equals(symbol)) {
            if(count == 1)
                shrink += symbol;
            else
                shrink += "("+symbol+")"+count;
            count = 0;
        }
    }
    return shrink;
}

Upvotes: 1

Related Questions