Reputation: 1350
I have a python list as below,
list_fs = ['drwxrwx--- - uname 0 2017-08-25 12:10 hdfs://filepath=2011-01-31 16%3A06%3A09.0',
'drwxrwx--- - uname 0 2017-08-29 14:12 hdfs://filepath=2011-02-28 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-29 14:20 hdfs://filepath=2011-03-31 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-29 14:32 hdfs://filepath=2011-04-30 10%3A00%3A00',
'drwxrwx--- - uname 0 2018-02-20 13:57 hdfs://filepath=2011-05-31 08%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-29 15:02 hdfs://filepath=2011-05-31 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-29 15:06 hdfs://filepath=2011-06-30 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 10:38 hdfs://filepath=2011-07-31 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 10:42 hdfs://filepath=2011-08-31 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 11:08 hdfs://filepath=2011-09-30 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 11:11 hdfs://filepath=2011-10-31 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 11:15 hdfs://filepath=2011-11-30 10%3A00%3A00',
'drwxrwx--- - uname 0 2017-08-31 11:16 hdfs://filepath=2011-12-31 10%3A00%3A00']
I need to extract the timestamp and filepath into a pandas dataframe. The timestamp column needs to be in timestamp datatype and As below.
What is the best way to do this?
Upvotes: 1
Views: 107
Reputation: 12417
I think a quite straight solution can be this:
df = pd.DataFrame({'list_fs': list_fs})
df['TimeStamp_Ordered'] = df.list_fs.str.split(' ').str[6] +' ' + df.list_fs.str.split(' ').str[7]
df['TimeStamp_Ordered'] = pd.to_datetime(df['TimeStamp_Ordered']
df['FilePath'] = df.list_fs.str.split(' ').str[8] + ' ' + df.list_fs.str.split(' ').str[9]
df = df.drop('list_fs', axis=1)
df = df.sort_values(by='TimeStamp_Ordered',ascending=True)
Output:
TimeStamp_Ordered FilePath
0 2017-08-25 12:10:00 hdfs://filepath=2011-01-31 16%3A06%3A09.0
1 2017-08-29 14:12:00 hdfs://filepath=2011-02-28 10%3A00%3A00
2 2017-08-29 14:20:00 hdfs://filepath=2011-03-31 10%3A00%3A00
3 2017-08-29 14:32:00 hdfs://filepath=2011-04-30 10%3A00%3A00
5 2017-08-29 15:02:00 hdfs://filepath=2011-05-31 10%3A00%3A00
6 2017-08-29 15:06:00 hdfs://filepath=2011-06-30 10%3A00%3A00
7 2017-08-31 10:38:00 hdfs://filepath=2011-07-31 10%3A00%3A00
8 2017-08-31 10:42:00 hdfs://filepath=2011-08-31 10%3A00%3A00
9 2017-08-31 11:08:00 hdfs://filepath=2011-09-30 10%3A00%3A00
10 2017-08-31 11:11:00 hdfs://filepath=2011-10-31 10%3A00%3A00
11 2017-08-31 11:15:00 hdfs://filepath=2011-11-30 10%3A00%3A00
12 2017-08-31 11:16:00 hdfs://filepath=2011-12-31 10%3A00%3A00
4 2018-02-20 13:57:00 hdfs://filepath=2011-05-31 08%3A00%3A00
Upvotes: 0
Reputation: 383
import pandas as pd
df = pd.DataFrame(list_fs)
df['Timestamp_ordered'] = [re.findall('\d+-\d+-\d+ \d+:\d+',i)[0] for i in list_fs]
df['FilePath'] = [re.findall('hdfs:.*', i)[0] for i in list_fs]
df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')
Upvotes: 2
Reputation: 51335
I think using regex
is probably a good way to go (using str.findall
):
import pandas as pd
df = pd.DataFrame(list_fs)
df['Timestamp_ordered'] = df[0].str.findall('\d{4}-\d{2}-\d{2} \d{2}:\d{2}').apply(pd.Series)
df['FilePath'] = df[0].str.findall('(\w{4}://.*)').apply(pd.Series)
# Sort and get rid of old column
df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')
Which returns:
>>> df
Timestamp_ordered FilePath
0 2017-08-25 12:10 hdfs://filepath=2011-01-31 16%3A06%3A09.0
1 2017-08-29 14:12 hdfs://filepath=2011-02-28 10%3A00%3A00
2 2017-08-29 14:20 hdfs://filepath=2011-03-31 10%3A00%3A00
3 2017-08-29 14:32 hdfs://filepath=2011-04-30 10%3A00%3A00
5 2017-08-29 15:02 hdfs://filepath=2011-05-31 10%3A00%3A00
6 2017-08-29 15:06 hdfs://filepath=2011-06-30 10%3A00%3A00
7 2017-08-31 10:38 hdfs://filepath=2011-07-31 10%3A00%3A00
8 2017-08-31 10:42 hdfs://filepath=2011-08-31 10%3A00%3A00
9 2017-08-31 11:08 hdfs://filepath=2011-09-30 10%3A00%3A00
10 2017-08-31 11:11 hdfs://filepath=2011-10-31 10%3A00%3A00
11 2017-08-31 11:15 hdfs://filepath=2011-11-30 10%3A00%3A00
12 2017-08-31 11:16 hdfs://filepath=2011-12-31 10%3A00%3A00
4 2018-02-20 13:57 hdfs://filepath=2011-05-31 08%3A00%3A00
Upvotes: 1