Lisa Anna
Lisa Anna

Reputation: 87

Regex to match words but not numbers with certain characters

I'm trying to match company names and ignore measurements/quantities. But I'm having a bit of trouble.

Example data:

8G Kingston Single DDR3-1600 CL11 Desktop RAM (KVR16N11/8)
8 Outlet Belkin Surge Protector With 2 Meter Cord
0.5M Yellow CAT6 Network Cable
100" Intact 16x -R DVD
15.6"  Topload Notebook (Black)
120mm Aluminum Filter Silver
8P TP-Link 10/100 Desktop Switch
8Ware 0.5M CAT5E Network Cable
Acer Aspire Alpha 12" QHD IPS Display Intel Core i7 Touch Laptop
ACER Aspire E5 15.6" HD Intel Core i5 Laptop
Asus SDRW-08D2S-U Slim External USB 2.0 DVD Read/Writer - Black

I was hoping to match the company names but ignore the gigabytes (G) single digits, 100", 15.6" tokens etc.

So ideally it'd match:

Kingston Single DDR3-1600 CL11 Desktop RAM (KVR16N11/8)
Outlet Belkin Surge Protector With 2 Meter Cord
Yellow CAT6 Network Cable
Intact 16x -R DVD
Topload Notebook (Black)
Aluminum Filter Silver
TP-Link 10/100 Desktop Switch
8Ware 0.5M CAT5E Network Cable
Acer Aspire Alpha 12" QHD IPS Display Intel Core i7 Touch Laptop
ACER Aspire E5 15.6" HD Intel Core i5 Laptop
Asus SDRW-08D2S-U Slim External USB 2.0 DVD Read/Writer - Black

The expression I tweaked with is below, but I'm matching mm (the 120mm line) because I want the 8Ware matching.

Upvotes: 0

Views: 142

Answers (1)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

Based upon the data you have provided, I have come up with a regex which you can use. Here is the sample code that you can run and see it prints your desired results.

public static void main(String[] args) {

    List<String> dataList = new ArrayList<String>();
    dataList.add("8G Kingston Single DDR3-1600 CL11 Desktop RAM (KVR16N11/8)");
    dataList.add("8 Outlet Belkin Surge Protector With 2 Meter Cord");
    dataList.add("0.5M Yellow CAT6 Network Cable");
    dataList.add("100\" Intact 16x -R DVD");
    dataList.add("15.6\"  Topload Notebook (Black)");
    dataList.add("120mm Aluminum Filter Silver");
    dataList.add("8P TP-Link 10/100 Desktop Switch");
    dataList.add("8Ware 0.5M CAT5E Network Cable");
    dataList.add("Acer Aspire Alpha 12\" QHD IPS Display Intel Core i7 Touch Laptop");
    dataList.add("ACER Aspire E5 15.6\" HD Intel Core i5 Laptop");
    dataList.add("Asus SDRW-08D2S-U Slim External USB 2.0 DVD Read/Writer - Black");

    System.out.println("Before:");
    for (String s : dataList) {
        System.out.println(s);
    }
    System.out.println();
    System.out.println("After:");
    for (String s : dataList) {
        System.out.println(s.replaceAll("(^[0-9.]+[a-zA-Z\"]{0,2}\\s+)(.*)", "$2"));
    }

}

Following is the output of this program upon running which is exactly what you wanted.

Before:
8G Kingston Single DDR3-1600 CL11 Desktop RAM (KVR16N11/8)
8 Outlet Belkin Surge Protector With 2 Meter Cord
0.5M Yellow CAT6 Network Cable
100" Intact 16x -R DVD
15.6"  Topload Notebook (Black)
120mm Aluminum Filter Silver
8P TP-Link 10/100 Desktop Switch
8Ware 0.5M CAT5E Network Cable
Acer Aspire Alpha 12" QHD IPS Display Intel Core i7 Touch Laptop
ACER Aspire E5 15.6" HD Intel Core i5 Laptop
Asus SDRW-08D2S-U Slim External USB 2.0 DVD Read/Writer - Black

After:
Kingston Single DDR3-1600 CL11 Desktop RAM (KVR16N11/8)
Outlet Belkin Surge Protector With 2 Meter Cord
Yellow CAT6 Network Cable
Intact 16x -R DVD
Topload Notebook (Black)
Aluminum Filter Silver
TP-Link 10/100 Desktop Switch
8Ware 0.5M CAT5E Network Cable
Acer Aspire Alpha 12" QHD IPS Display Intel Core i7 Touch Laptop
ACER Aspire E5 15.6" HD Intel Core i5 Laptop
Asus SDRW-08D2S-U Slim External USB 2.0 DVD Read/Writer - Black

Like I said above, I have already given you a base regex and you may have to tweak it based upon your actual data if case you have more, else you are good already.

EDIT1:

Ok, as requested in comments, editing the answer to include the explanation of the regex.

(^[0-9.]+[a-zA-Z\"]{0,2}\s+)(.*)

The regex has two parts. First part (^[0-9.]+[a-zA-Z\"]{0,2}\s+) tries to match the measurements/quantities data. And second part just tries to match the remaining data which is supposedly the rest of the line. Elaborating only first part as second part (.*) is pretty trivial.

(^[0-9.]+[a-zA-Z\"]{0,2}\s+)

^ --> is for matching the start of data as measurement data is in the beginning of the line.

[0-9.]+ --> Matches the numbers one or more in the measurements/quantities data which can include a dot character.

[a-zA-Z\"]{0,2} --> This matches the units of data like G,M,mm," which according to given data can have length 0 to 2. E.g. "8 Outlet..." line does not have any units hence I had to use {0,2} else could have used {1,2}. And to avoid matching "8Ware ..." as measurement data, which you didn't want to match, I had to restrict the upper limit to 2.

\s+ is to just eat up one or more spaces present after measurement data.

So whole regex is matched and then replaced by $2, meaning only data captured by second part of regex (.*)

Hope that clarifies. Let me know in case you need explanation on any part further.

Upvotes: 1

Related Questions