Tom
Tom

Reputation: 85

Removing whitespaces from a string based on a condition

I'm looking for help in splitting my data. My data has spaces as thousand separators but there's also spaces between my timestamps.

This is an example of what the data looks like (this is currently 1 column):

Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40 
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20

I would need it to look like this:

Date/Time           Var1      Var2     Var3      Var4      Var5      Var6
17/04/2020 00:00:00 133579.20 31978.90 377952.81 179412.41 203595.20 70447.40 
17/04/2020 01:00:00 133583.70 31980.40 377963.41 179412.41 203595.20 70448.20

I'm trying to go around this an odd and over-complicated way I believe. In one instance, I removed all whitespaces and then to split the dates I did:

df.iloc[:,3] = df.iloc[:,3].str.replace('/2020', '/2020 ').str.replace(':00:00', ':00:00 ')

And for then splitting all the numbers, I was trying to do something such as reading every character in a string and once it found a dot, add a whitespace 2 strings ahead but I didn't manage to get that to work.

for i in range(len(df)):
    for j in df.iloc[i,:]:
        for k in j:
            if k == '.':
               #to be continued

Surely there's a much quicker way to get around this. Could anyone give me a hand?

Upvotes: 1

Views: 856

Answers (3)

Alexandre B.
Alexandre B.

Reputation: 5502

You can try:

out = df[df.columns[0]].str.split(r"\s+")\
    .apply(lambda x: pd.Series([" ".join(x[:2])] + ["{}{}".format(a, b) for a, b in zip(x[2::2], x[3::2])])) \
    .rename(columns={old:new for old, new in enumerate(cols)})

Explanations:

  1. After selecting the only column using df[df.columns[0]], split this column according all spaces using pandas.Series.str.split. The regex is simple \s+: df[df.columns[0]].str.split(r"\s+")
  2. Apply a custom function on each rows using apply.
    1. First recreate the date by merging the 2 first elements with a space, convert to a date and wrap it in a list: [pd.Timestamp(" ".join(x[:2]))]
    2. Merge all the others values 2-by2 using zip. This discussion provides more details.

[float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]

  1. Convert this list in pd.Series wrapping result from step 2.1 and 2.2.

  2. Rename the columns using rename. A dict-comprehension let us perform the expected result: .rename(columns={old:new for old, new in enumerate(cols)})


Full code + illustration:

print(df)
#              Date/Time Var1 Var2 Var3 Var4 Var5 Var6
# 0  17/04/2020 00:00:00 133 579.20 31 978.90 377 9...
# 1  17/04/2020 01:00:00 133 583.70 31 980.40 377 9...

# Step 1
print(df[df.columns[0]].str.split(r"\s+"))
# 0    [17/04/2020, 00:00:00, 133, 579.20, 31, 978.90...
# 1    [17/04/2020, 01:00:00, 133, 583.70, 31, 980.40...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.1
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: [pd.Timestamp(" ".join(x[:2]))]))
# 0    [2020-04-17 00:00:00]
# 1    [2020-04-17 01:00:00]
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.2
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]))
# 0    [133579.2, 31978.9, 377952.81, 179412.41, 2035...
# 1    [133583.7, 31980.4, 377963.41, 179412.41, 2035...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.3
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])))
#                     0         1        2          3          4         5        6
# 0 2020-04-17 00:00:00  133579.2  31978.9  377952.81  179412.41  203595.2  70447.4
# 1 2020-04-17 01:00:00  133583.7  31980.4  377963.41  179412.41  203595.2  70448.2


# Step 3
print(df.columns[0].split(" "))
# ['Date/Time', 'Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6']


out = df[df.columns[0]].str.split(r"\s+")\
    .apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])) \
    .rename(columns={old: new for old, new in enumerate(df.columns[0].split(" "))})
print(out)
#             Date/Time      Var1     Var2       Var3       Var4      Var5     Var6
# 0 2020-04-17 00:00:00  133579.2  31978.9  377952.81  179412.41  203595.2  70447.4
# 1 2020-04-17 01:00:00  133583.7  31980.4  377963.41  179412.41  203595.2  70448.2


print(out.dtypes)
# Date/Time    datetime64[ns]
# Var1                float64
# Var2                float64
# Var3                float64
# Var4                float64
# Var5                float64
# Var6                float64
# dtype: object

Upvotes: 1

Błotosmętek
Błotosmętek

Reputation: 12927

Assuming df is your present dataframe and it has one column named simply 'D' (if it is not 'D', change accordingly):

tmplist = df['D'].str.findall(r'(.+?[:.]\S+\s+)').to_list()
tmplist = [ [ e.replace(' ','') if i>0 else e.rstrip() for i, e in enumerate(row) ] for row in tmplist ]
col = ['Date/Time'] + [ 'Var{}'.format(i) for i in range(1,len(tmplist[0])) ]
df = pandas.DataFrame(tmplist, columns=col)

The first line converts the dataframe into a list of lists, splitting your strings as needed. A certain trick is used: a dot or colon followed by digits is the last of space-separated items that belong to one column (colon for timestamp, dot for floating-point numbers).

The second one removes all spaces from all columns except the first (timestamps), where it only removes trailing spaces.

Next one creates columns names according to your wish.

The last one rebuilds the dataframe from the list.

Upvotes: 2

lucky6qi
lucky6qi

Reputation: 994

string = """Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40 
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20"""

head = string.split('\n')[0].split(' ')
values = []
import re
value_regex = re.compile(' \d\d\d\.\d*')
timestamp_regex = re.compile('\d\d:\d\d:\d\d')
for line in string.split('\n')[1:]:
    for value in value_regex.findall(line):
        line = line.replace(value, value.replace(' ','')+',')
    for timestamp in timestamp_regex.findall(line):
        line = line.replace(timestamp, timestamp+',')
    value_cur_line =  [sep[1:] if sep.startswith(' ') else sep for sep in line.split(',') if sep.replace(' ','')!='']
    values.append(value_cur_line)

Upvotes: 0

Related Questions