Reputation: 85
I'm looking for help in splitting my data. My data has spaces as thousand separators but there's also spaces between my timestamps.
This is an example of what the data looks like (this is currently 1 column):
Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20
I would need it to look like this:
Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133579.20 31978.90 377952.81 179412.41 203595.20 70447.40
17/04/2020 01:00:00 133583.70 31980.40 377963.41 179412.41 203595.20 70448.20
I'm trying to go around this an odd and over-complicated way I believe. In one instance, I removed all whitespaces and then to split the dates I did:
df.iloc[:,3] = df.iloc[:,3].str.replace('/2020', '/2020 ').str.replace(':00:00', ':00:00 ')
And for then splitting all the numbers, I was trying to do something such as reading every character in a string and once it found a dot, add a whitespace 2 strings ahead but I didn't manage to get that to work.
for i in range(len(df)):
for j in df.iloc[i,:]:
for k in j:
if k == '.':
#to be continued
Surely there's a much quicker way to get around this. Could anyone give me a hand?
Upvotes: 1
Views: 856
Reputation: 5502
You can try:
out = df[df.columns[0]].str.split(r"\s+")\
.apply(lambda x: pd.Series([" ".join(x[:2])] + ["{}{}".format(a, b) for a, b in zip(x[2::2], x[3::2])])) \
.rename(columns={old:new for old, new in enumerate(cols)})
Explanations:
df[df.columns[0]]
, split this column according all spaces using pandas.Series.str.split
. The regex is simple \s+
: df[df.columns[0]].str.split(r"\s+")
apply
.
[pd.Timestamp(" ".join(x[:2]))]
zip
. This discussion provides more details.[float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]
Convert this list in pd.Series
wrapping result from step 2.1 and 2.2.
Rename the columns using rename
. A dict-comprehension let us perform the expected result: .rename(columns={old:new for old, new in enumerate(cols)})
Full code + illustration:
print(df)
# Date/Time Var1 Var2 Var3 Var4 Var5 Var6
# 0 17/04/2020 00:00:00 133 579.20 31 978.90 377 9...
# 1 17/04/2020 01:00:00 133 583.70 31 980.40 377 9...
# Step 1
print(df[df.columns[0]].str.split(r"\s+"))
# 0 [17/04/2020, 00:00:00, 133, 579.20, 31, 978.90...
# 1 [17/04/2020, 01:00:00, 133, 583.70, 31, 980.40...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object
# Step 2.1
print(df[df.columns[0]].str.split(r"\s+")
.apply(lambda x: [pd.Timestamp(" ".join(x[:2]))]))
# 0 [2020-04-17 00:00:00]
# 1 [2020-04-17 01:00:00]
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object
# Step 2.2
print(df[df.columns[0]].str.split(r"\s+")
.apply(lambda x: [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]))
# 0 [133579.2, 31978.9, 377952.81, 179412.41, 2035...
# 1 [133583.7, 31980.4, 377963.41, 179412.41, 2035...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object
# Step 2.3
print(df[df.columns[0]].str.split(r"\s+")
.apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])))
# 0 1 2 3 4 5 6
# 0 2020-04-17 00:00:00 133579.2 31978.9 377952.81 179412.41 203595.2 70447.4
# 1 2020-04-17 01:00:00 133583.7 31980.4 377963.41 179412.41 203595.2 70448.2
# Step 3
print(df.columns[0].split(" "))
# ['Date/Time', 'Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6']
out = df[df.columns[0]].str.split(r"\s+")\
.apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])) \
.rename(columns={old: new for old, new in enumerate(df.columns[0].split(" "))})
print(out)
# Date/Time Var1 Var2 Var3 Var4 Var5 Var6
# 0 2020-04-17 00:00:00 133579.2 31978.9 377952.81 179412.41 203595.2 70447.4
# 1 2020-04-17 01:00:00 133583.7 31980.4 377963.41 179412.41 203595.2 70448.2
print(out.dtypes)
# Date/Time datetime64[ns]
# Var1 float64
# Var2 float64
# Var3 float64
# Var4 float64
# Var5 float64
# Var6 float64
# dtype: object
Upvotes: 1
Reputation: 12927
Assuming df
is your present dataframe and it has one column named simply 'D'
(if it is not 'D'
, change accordingly):
tmplist = df['D'].str.findall(r'(.+?[:.]\S+\s+)').to_list()
tmplist = [ [ e.replace(' ','') if i>0 else e.rstrip() for i, e in enumerate(row) ] for row in tmplist ]
col = ['Date/Time'] + [ 'Var{}'.format(i) for i in range(1,len(tmplist[0])) ]
df = pandas.DataFrame(tmplist, columns=col)
The first line converts the dataframe into a list of lists, splitting your strings as needed. A certain trick is used: a dot or colon followed by digits is the last of space-separated items that belong to one column (colon for timestamp, dot for floating-point numbers).
The second one removes all spaces from all columns except the first (timestamps), where it only removes trailing spaces.
Next one creates columns names according to your wish.
The last one rebuilds the dataframe from the list.
Upvotes: 2
Reputation: 994
string = """Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20"""
head = string.split('\n')[0].split(' ')
values = []
import re
value_regex = re.compile(' \d\d\d\.\d*')
timestamp_regex = re.compile('\d\d:\d\d:\d\d')
for line in string.split('\n')[1:]:
for value in value_regex.findall(line):
line = line.replace(value, value.replace(' ','')+',')
for timestamp in timestamp_regex.findall(line):
line = line.replace(timestamp, timestamp+',')
value_cur_line = [sep[1:] if sep.startswith(' ') else sep for sep in line.split(',') if sep.replace(' ','')!='']
values.append(value_cur_line)
Upvotes: 0