Moax
Moax

Reputation: 50

Separate string column into multiple columns in Pandas

There is this one column that has two variables stored namely date and company name. My goal is to seperate these two variables into two columns.

date_time/full_company_name

My approach so far was:

df['date_time'] = [i[:10] for i in df['date_time/full_company_name']]
df['full_company_name'] = [i[10:] for i in df['date_time/full_company_name']]
df.drop('date_time/full_company_name', axis=1, inplace=True)

The code above has worked well, however there are a number of botched data entries in the data set such as:

I have thought about some possible solutions such as using a loop with a bunch of if statements to handle the exceptions or perhaps inserting some kind of delimiter into the string and then using string.split('_'). But these workarounds are fairly cumbersome.

I can't help but wonder if there is a more generic function or method available out there.

Upvotes: 2

Views: 130

Answers (1)

In general iterating over the contents of a pandas series or dataframe is bad for performance and instead vectorized approaches are recommended when possible:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided...

See the docs.

That said, for the "non-botched" entries you can do this:

df['date_time'] = df['date_time/full_company_name'].str.slice(0, 10)
df['full_company_name'] = df['date_time/full_company_name'].str.slice(10, None)
df.drop('date_time/full_company_name', axis=1, inplace=True)

For botched entries, if there is no consistent pattern, it's gonna be hard to come up with a programmatic approach, but something that works for your examples and could be extended with other known entry errors is using extract with a regular expression:

# The first capture group is either a Y-M-D date, the NaN string or a sequence of digits
# The second one is any string
pattern = r'(\d{4}-\d{2}-\d{2}|NaN|0|\d+)(.+)'

df['date_time/full_company_name'].str.extract(pattern)

Output:

            0                             1
0  2020-05-19                 Lopez-Wallace
1  2020-05-12                   Smith-Simon
2  2020-10-02                   Jenkins Inc
3  2020-07-06                   Moore-Weiss
4           0       Lopez, Barton and Jones
5         NaN  Brown, Singleton and Harrell
6       84635                   Ball-Thomas

Upvotes: 1

Related Questions