mlabenski
mlabenski

Reputation: 33

splitting a pandas column by delimiter, two different sizes in the rows

I'm attempting to split up a column in a pandas dataframe into multiple by a space delimiter. I realized that some of the rows have a date field and therefore it would need extra columns compared to the ones without it. Heres an example of column values,

DA Firstname Lastname 09/30/2020 07:44 AM 9/23/2020 6:06:38 PM
JW Firstname Lastname 10/25/2020 11:06 AM None

The first row wouldn't fit a space delimiter because there are 8 spaces. The second row would work for my dataset because there would be 6 spaces. Is there anyway to combine the dates together as the delimiter?

["Inital" "Firstname" "lastname" "date/time1" "date/time2"] where the "date/time2" column could also include "None"

The code I tried using was,

dataset= pd.read_csv("newOutput6",encoding = "ISO-8859-1", delimiter="\t", names = ['Name','Date'], index=False)
tmpDF = pd.DataFrame(columns=['Initals','FName','LName','SignupTime','Waiver'])
tmpDF[['Initals','FName','LName','SignupTime','Waiver']] = dataset['Name'].str.split(' ', expand=True)

sample csv:

 ,Name,Date
0,MA FName LName 10/25/2020 09:40 PM None,"October 26, 2020,8:00AM Until 8:50AM "
1,JB FName LName 10/26/2020 07:19 AM None,"October 26, 2020,8:00AM Until 8:50AM "
2,TB FName LName 10/25/2020 09:03 PM None,"October 26, 2020,8:00AM Until 8:50AM "
3,MB FName LName 10/25/2020 09:40 PM None,"October 26, 2020,8:00AM Until 8:50AM "
4,NC FName LName 10/25/2020 10:17 PM None,"October 26, 2020,8:00AM Until 8:50AM "
5,AC FName LName 10/25/2020 09:23 PM None,"October 26, 2020,8:00AM Until 8:50AM "
6,NF FName LName 10/26/2020 07:56 AM None,"October 26, 2020,8:00AM Until 8:50AM "
7,BG FName LName 10/25/2020 10:41 PM None,"October 26, 2020,8:00AM Until 8:50AM "
8,GH FName LName 10/26/2020 07:39 AM None,"October 26, 2020,8:00AM Until 8:50AM "
9,EH FName LName 10/25/2020 10:06 PM None,"October 26, 2020,8:00AM Until 8:50AM "
10,DM FName LName 10/25/2020 11:42 PM None,"October 26, 2020,8:00AM Until 8:50AM "
11,JM FName LName 10/25/2020 09:24 PM None,"October 26, 2020,8:00AM Until 8:50AM "
12,TP FName LName 10/26/2020 12:32 AM None,"October 26, 2020,8:00AM Until 8:50AM "
13,DS FName LName 10/25/2020 11:12 PM None,"October 26, 2020,8:00AM Until 8:50AM "
14,KS FName LName 10/25/2020 07:46 PM None,"October 26, 2020,8:00AM Until 8:50AM "
15,JW FName LName 10/25/2020 11:06 AM None,"October 26, 2020,8:00AM Until 8:50AM "
16,DA FName LName 09/30/2020 07:44 AM 9/23/2020 6:06:38 PM,"October 26, 2020,9:00AM Until 9:50AM "

Index 16 is a row that doesnt follow the traditional format, and I suspect a regex is needed to determine that.

Upvotes: 1

Views: 206

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150745

IN the case there is no spaces in FirstName and LastName (else how you distinguish them):

pattern = ('^(?P<Initials>\w+)\s'
           + '(?P<FName>\w+)\s'
           + '(?P<LName>\w+)\s'
           + '(?P<SignupTime>\d+/\d+/\d+ \d+:\d+ \w+)\s'
           + '(?P<Waiver>.*)'
)

df['name'].str.extract(pattern)

Output:

  Initials      FName     LName           SignupTime                Waiver
0       DA  Firstname  Lastname  09/30/2020 07:44 AM  9/23/2020 6:06:38 PM
1       JW  Firstname  Lastname  10/25/2020 11:06 AM                  None

Update: For optional Initials, you can try this pattern:

pattern = ('^(?P<Initials>\w+\s)?'    # make initial optional
           + '(?P<FName>\w+)\s+'
           + '(?P<LName>\w+)\s+'
           + '(?P<SignupTime>\d+/\d+/\d+ \d+:\d+ \w+)\s'
           + '(?P<Waiver>.*)'
)

Note that, now if Initials exists, there will be a trailing space, which you can easily handle.

Upvotes: 2

Related Questions