Extracting text from complex string in excel

Question

The attached image (link: https://i.sstatic.net/w0pEw.png) shows a range of cells (B1:B7) from a table I imported from the web. I need a formula that allows me to extract the names from each cell. In this case, my objective is to generate the following list of names, where each name is in its own cell: Erik Karlsson, P.K. Subban, John Tavares, Matthew Tkachuk, Steven Stamkos, Dustin Brown, Shea Weber.

I have been reading about left, right, and mid functions, but I'm confused by the irregular spacing and special characters (i.e. the box with question mark beside some names).

Can anyone help me extract the names? Thanks

Zack · Accepted Answer

Assuming that your cells follow the same format, you can use a variety of text functions to get the name.

This function requires the following format:

Some initial text, followed by
2 new lines in Excel (represented by CHAR(10)
The name, which consists of a first name, a space, then a last name
A second space on the same line as the name, followed by some additional text.

With this format, you can use the following formula (assuming your data is in an Excel table, with the column of initial data named Text):

=MID([@Text],SEARCH(CHAR(10),[@Text],SEARCH(CHAR(10),[@Text])+1)+1,SEARCH(" ",MID([@Text],SEARCH(CHAR(10),[@Text],SEARCH(CHAR(10),[@Text])+1)+1,LEN([@Text])),SEARCH(" ",MID([@Text],SEARCH(CHAR(10),[@Text],SEARCH(CHAR(10),[@Text])+1)+1,LEN([@Text])))+1)-1)

To come up with this formula, we take the following steps:

First, we figure out where the name starts. We know this occurs after the 2 new lines, so we use:

=SEARCH(CHAR(10),[@Text],SEARCH(CHAR(10),[@Text])+1)+1

The inner (occurring second) SEARCH finds the first new line, and the outer (occurring first) finds the 2nd new line.

Now that we have that value, we can use it to determine the rest of the string (after the 2 new lines). Let's say that the previous formula was stored in a table column called Start of Name. The 2nd formula will then be:

=MID([@Text],[@[Start of Name]],LEN([@Text]))

Note that we're using the length of the entire text, which by definition is more than we need. However, that's not an issue, since Excel returns the smaller amount between the last argument to MID and the actual length of the text.

Once we have the text from the start of the name on, we need to calculate the position of the 2nd space (where the name ends). To do that, we need to calculate the position of the first space. This is similar to how we calculated the start of the name earlier (which starts after 2 new lines). The function we need is:

=SEARCH(" ",[@[Rest of String]],SEARCH(" ",[@[Rest of String]])+1)-1

So now, we know where the name starts (after 2 new lines), and where it ends (after the 2nd space). Assuming we have these numbers stored in columns named Start of Name and To Second Space respectively, we can use the following formula to get the name:

=MID([@Text],[@[Start of Name]],[@[To Second Space]])

This is equivalent to the first formula: The difference is that the first formula doesn't use any "helper columns".

Of course, if any cell doesn't match this format, then you'll be out of luck. Using Excel formulas to parse text can be finicky and inflexible. For example, if someone has a middle name, or someone has a initials with spaces (e.g. P.K. Subban was P. K. Subban), or there was a Jr. or something, your job would be a lot harder.

Another alternative is to use regular expressions to get the data you want. I would recommend this thorough answer as a primer. Although you still have the same issues with name formats.

Finally, there's the obligatory Falsehoods Programmers Believe About Names as a warning against assuming any kind of standardized name format.

Extracting text from complex string in excel

Answers (1)

Related Questions