Tronald Dump
Tronald Dump

Reputation: 1350

Extract values from List to Pandas DF

I have a python list as below,

list_fs = ['drwxrwx---   - uname 0 2017-08-25 12:10 hdfs://filepath=2011-01-31 16%3A06%3A09.0',
 'drwxrwx---   - uname 0 2017-08-29 14:12 hdfs://filepath=2011-02-28 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 14:20 hdfs://filepath=2011-03-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 14:32 hdfs://filepath=2011-04-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2018-02-20 13:57 hdfs://filepath=2011-05-31 08%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 15:02 hdfs://filepath=2011-05-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 15:06 hdfs://filepath=2011-06-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 10:38 hdfs://filepath=2011-07-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 10:42 hdfs://filepath=2011-08-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:08 hdfs://filepath=2011-09-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:11 hdfs://filepath=2011-10-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:15 hdfs://filepath=2011-11-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:16 hdfs://filepath=2011-12-31 10%3A00%3A00']

I need to extract the timestamp and filepath into a pandas dataframe. The timestamp column needs to be in timestamp datatype and As below.

enter image description here

What is the best way to do this?

Upvotes: 1

Views: 107

Answers (3)

Joe
Joe

Reputation: 12417

I think a quite straight solution can be this:

df = pd.DataFrame({'list_fs': list_fs})
df['TimeStamp_Ordered'] = df.list_fs.str.split(' ').str[6]  +' ' + df.list_fs.str.split(' ').str[7]
df['TimeStamp_Ordered'] = pd.to_datetime(df['TimeStamp_Ordered']
df['FilePath'] = df.list_fs.str.split(' ').str[8] + ' ' + df.list_fs.str.split(' ').str[9]
df = df.drop('list_fs', axis=1)
df = df.sort_values(by='TimeStamp_Ordered',ascending=True)

Output:

     TimeStamp_Ordered                                   FilePath
0  2017-08-25 12:10:00  hdfs://filepath=2011-01-31 16%3A06%3A09.0
1  2017-08-29 14:12:00    hdfs://filepath=2011-02-28 10%3A00%3A00
2  2017-08-29 14:20:00    hdfs://filepath=2011-03-31 10%3A00%3A00
3  2017-08-29 14:32:00    hdfs://filepath=2011-04-30 10%3A00%3A00
5  2017-08-29 15:02:00    hdfs://filepath=2011-05-31 10%3A00%3A00
6  2017-08-29 15:06:00    hdfs://filepath=2011-06-30 10%3A00%3A00
7  2017-08-31 10:38:00    hdfs://filepath=2011-07-31 10%3A00%3A00
8  2017-08-31 10:42:00    hdfs://filepath=2011-08-31 10%3A00%3A00
9  2017-08-31 11:08:00    hdfs://filepath=2011-09-30 10%3A00%3A00
10 2017-08-31 11:11:00    hdfs://filepath=2011-10-31 10%3A00%3A00
11 2017-08-31 11:15:00    hdfs://filepath=2011-11-30 10%3A00%3A00
12 2017-08-31 11:16:00    hdfs://filepath=2011-12-31 10%3A00%3A00
4  2018-02-20 13:57:00    hdfs://filepath=2011-05-31 08%3A00%3A00

Upvotes: 0

iRhonin
iRhonin

Reputation: 383

import pandas as pd

df = pd.DataFrame(list_fs)

df['Timestamp_ordered'] = [re.findall('\d+-\d+-\d+ \d+:\d+',i)[0] for i in list_fs]
df['FilePath'] = [re.findall('hdfs:.*', i)[0] for i in list_fs]

df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')

Upvotes: 2

sacuL
sacuL

Reputation: 51335

I think using regex is probably a good way to go (using str.findall):

import pandas as pd

df = pd.DataFrame(list_fs)

df['Timestamp_ordered'] = df[0].str.findall('\d{4}-\d{2}-\d{2} \d{2}:\d{2}').apply(pd.Series)
df['FilePath'] = df[0].str.findall('(\w{4}://.*)').apply(pd.Series)
# Sort and get rid of old column
df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')

Which returns:

>>> df
   Timestamp_ordered                                   FilePath
0   2017-08-25 12:10  hdfs://filepath=2011-01-31 16%3A06%3A09.0
1   2017-08-29 14:12    hdfs://filepath=2011-02-28 10%3A00%3A00
2   2017-08-29 14:20    hdfs://filepath=2011-03-31 10%3A00%3A00
3   2017-08-29 14:32    hdfs://filepath=2011-04-30 10%3A00%3A00
5   2017-08-29 15:02    hdfs://filepath=2011-05-31 10%3A00%3A00
6   2017-08-29 15:06    hdfs://filepath=2011-06-30 10%3A00%3A00
7   2017-08-31 10:38    hdfs://filepath=2011-07-31 10%3A00%3A00
8   2017-08-31 10:42    hdfs://filepath=2011-08-31 10%3A00%3A00
9   2017-08-31 11:08    hdfs://filepath=2011-09-30 10%3A00%3A00
10  2017-08-31 11:11    hdfs://filepath=2011-10-31 10%3A00%3A00
11  2017-08-31 11:15    hdfs://filepath=2011-11-30 10%3A00%3A00
12  2017-08-31 11:16    hdfs://filepath=2011-12-31 10%3A00%3A00
4   2018-02-20 13:57    hdfs://filepath=2011-05-31 08%3A00%3A00

Upvotes: 1

Related Questions