Reputation: 50
There is this one column that has two variables stored namely date and company name. My goal is to seperate these two variables into two columns.
date_time/full_company_name
My approach so far was:
df['date_time'] = [i[:10] for i in df['date_time/full_company_name']]
df['full_company_name'] = [i[10:] for i in df['date_time/full_company_name']]
df.drop('date_time/full_company_name', axis=1, inplace=True)
The code above has worked well, however there are a number of botched data entries in the data set such as:
I have thought about some possible solutions such as using a loop with a bunch of if statements to handle the exceptions or perhaps inserting some kind of delimiter into the string and then using string.split('_')
. But these workarounds are fairly cumbersome.
I can't help but wonder if there is a more generic function or method available out there.
Upvotes: 2
Views: 130
Reputation: 3001
In general iterating over the contents of a pandas series or dataframe is bad for performance and instead vectorized approaches are recommended when possible:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided...
See the docs.
That said, for the "non-botched" entries you can do this:
df['date_time'] = df['date_time/full_company_name'].str.slice(0, 10)
df['full_company_name'] = df['date_time/full_company_name'].str.slice(10, None)
df.drop('date_time/full_company_name', axis=1, inplace=True)
For botched entries, if there is no consistent pattern, it's gonna be hard to come up with a programmatic approach, but something that works for your examples and could be extended with other known entry errors is using extract
with a regular expression:
# The first capture group is either a Y-M-D date, the NaN string or a sequence of digits
# The second one is any string
pattern = r'(\d{4}-\d{2}-\d{2}|NaN|0|\d+)(.+)'
df['date_time/full_company_name'].str.extract(pattern)
Output:
0 1
0 2020-05-19 Lopez-Wallace
1 2020-05-12 Smith-Simon
2 2020-10-02 Jenkins Inc
3 2020-07-06 Moore-Weiss
4 0 Lopez, Barton and Jones
5 NaN Brown, Singleton and Harrell
6 84635 Ball-Thomas
Upvotes: 1