Reputation: 148
I have a csv file from a database I've converted into a Pandas DataFrame that I'm trying to clean up. One of the issues is that multiple values have been input into single cells that need to be split up. The complicating factor is that there are string comments (also with commas) that need to be kept intact. The problem is illustrated in the example below, in Series form.
What I have:
Index | values
0 | 2.54,3.563
1 | bad design, right?
What I want:
Index | level_0 | values
0 | 0 | 2.54
1 | 0 | 3.563
2 | 1 | bad design, right?
As you can see, there are commas separating the values I want to split, with no following space after the comma, while the commas in string comments all have spaces after them. Seems like an easy thing to apply regex to split up. My solution below, using a strategy taken from another StackOverflow solution, is to use Series.str.split to separate the values into separate columns, then stack the columns. That strategy works great. However, in this case, the regex is apparently not identifying the split. Here's my code:
Import pandas as pd
# Example Series:
data = pd.Series(("2.54,3.56", "3.24,5.864", "bad design, right?"), name = "values")
# Split cells with multiple entries into separate rows
split_data = data.str.split('[,]\b').apply(pd.Series)
# Stack the results and pull out the index into a column (which is sample number in my case)
split_data = split_data.stack().reset_index(0)
split_data = split_data.reset_index(drop=True)
I'm new to regular expressions, but from the guides I've looked at and from using a couple regex sandboxes specific to Python, it seems like the regex [,]\b should split the values, but not the comments. However, it does not split anything with this regex.
Here's the result of the debugger, which says this should work: Debuggex Demo
Am I missing something easy here? Any better ideas on making this work? I'm using Python 3.5, if that makes a difference. Thanks.
Upvotes: 0
Views: 3009
Reputation: 3236
I would be inclined to use a lookahead; how you do so depends on your expected data.
This is a negative lookahead. it says "a comma that is not followed by whitespace" and would be preferred if you are sure that all comments with commas have whitespace, and would want to treat "red,green" as something to split.
data.str.split('[,](?!\s)').apply(pd.Series)
Another option is a positive lookahead for something that looks like a valid value; your example was numbers, so for instance this would split only on a comma that is followed by a number:
data.str.split('[,](?:\d)').apply(pd.Series)
Regular expressions are very powerful, but honestly, I am not sure that this solution will be great for you if this is a long-term problem. Getting most cases right as a one-time migration should be fine, but longer term I would consider trying to solve the problem before it gets here. Anyway, here's Debuggex's python regex cheat sheet, in case it is useful to you: https://www.debuggex.com/cheatsheet/regex/python
Upvotes: 1