Reputation: 2698
I'm working with the following DataFrame column containing Date |TimeStamp | Name | Message
as a string
59770 [08/10/18, 5:57:43 PM] Luke: Message
59771 [08/10/18, 5:57:48 PM] Luke: Message
59772 [08/10/18, 5:57:50 PM] Luke: Message
I use the following function to capture the Date.
def getdate(x):
res = re.search("\d\d/\d\d/\d\d",x)
and the following code to capture the rest of the data (TimeStamp | Name | Message) into columns:
df['Data'].str.extract(r'\s*(.{10})](.*):(.*)')
Is there a workaround to capture and extract all 4 entities together?
Please Advise
Upvotes: 2
Views: 699
Reputation: 3989
As an alternative you could use regex named groups
together with pandas extractall
.
import pandas as pd
import re
df = pd.DataFrame(
[" [08/10/18, 5:57:43 PM] Luke: Message",
" [08/10/18, 5:57:48 PM] Luke: Message",
" [08/10/18, 5:57:50 PM] Luke: Message"])
print(df)
regex = re.compile(\
r"(?P<date>\d{2}/\d{2}/\d{2}),\s*"
r"(?P<timestamp>\d+:\d+:\d+\s[AP]M)\]\s+"
r"(?P<name>.+?):\s*"
r"(?P<message>.+)$"
)
df_out = df[0].str.extractall(regex).droplevel(1)
print(df_out)
Output from df_out
date timestamp name message
0 08/10/18 5:57:43 PM Luke Message
1 08/10/18 5:57:48 PM Luke Message
2 08/10/18 5:57:50 PM Luke Message
Upvotes: 2
Reputation: 1081
I change the format of each line as follow in file "file.csv" :
08/10/18, 5:57:43 PM, Luke, Message
And then used from this code to read it as data frame :
import pandas as pd
df = pd.read_csv("file.csv")
print (df)
OutPut:
Date time name msg
0 08/10/18 5:57:43 PM Luke Message
Suppose your data is in file "file_data.txt" as follow format:
[08/10/18, 5:57:43 PM] Luke: Message
[08/10/18, 5:57:48 PM] Luke: Message
[08/10/18, 5:57:50 PM] Luke: Message
you can use from thease sed commands to convert data to csv :
sed -i "s/]/,/" file_data.txt
sed -i "s/\[//" file_data.txt
sed -i "s/:/,/" file_data.txt
Upvotes: 0