Reputation: 49
I have a dataframe for well data that imports as one column and I have split like this:
df = pd.read_csv(r'df.txt', header=None)
df.columns = ['Test']
df = df.Test.str.split(expand=True)
0 1 2 3 4 5 6
0 well 1 20060201 3623.23 0.00 1300.00 None
1 well 1 20060202 3627.07 0.00 1305.00 None
2 well 1 20060203 3576.48 0.00 1305.00 None
... ... .. ... ... ... ... ...
42089 well14 20201114 0.00 0.00 0.00 None
42090 well14 20201115 0.00 0.00 0.00 None
... ... .. ... ... ... ... ...
51000 well 7 20201116 0.00 0.00 0.00 None
51001 well 7 20201117 0.00 0.00 0.00 None
As you can see, some of my data contains well names that have their number attached to the name. The code reads this as one entity and puts it in column 0. Wells that had a space between the name and number were sorted into columns 0 and 1. Because of the shift, an extra column is added (col 6).
I have tried using str.extract() and str.split() to separate the number from the well name and move it into column 1, so that I could then use str.cat() to join columns 0 and 1 together, but I can't get it to work. I need this to work for varying well names and numbers.
In the end, I need my data to look like this:
0 1 2 3 4
0 well 1 20060201 3623.23 0.00 1300.00
1 well 1 20060202 3627.07 0.00 1305.00
2 well 1 20060203 3576.48 0.00 1305.00
... ... .. ... ... ... ...
42089 well 14 20201114 0.00 0.00 0.00
42090 well 14 20201115 0.00 0.00 0.00
... ... .. ... ... ... ...
51000 well 7 20201116 0.00 0.00 0.00
51001 well 7 20201117 0.00 0.00 0.00
Would appreciate any help with this. Thank you!
Upvotes: 1
Views: 311
Reputation: 150735
Let's try:
# extract the names and digits
tmp = df[0].str.extract('^(.*\D)(\d+)?$')
# where the names are attached to digits
name_with_digits = tmp[1].notna()
# shift these values horizotally
df.loc[name_with_digits, 1:] = df.loc[name_with_digits, 1:].shift(axis=1)
# update the names
df.loc[name_with_digits,[0,1]] = tmp
# concatenate the names
df[0] = df[0] + ' ' + df[1].astype(str)
# drop unnecessary columns
df = df.drop([1,6], axis=1)
Output:
0 2 3 4 5
0 well 1 20060201.0 3623.23 0.0 1300.00
1 well 1 20060202.0 3627.07 0.0 1305.00
2 well 1 20060203.0 3576.48 0.0 1305.00
42089 well 14 20201114 0.0 0.0 0.0
42090 well 14 20201115 0.0 0.0 0.0
Upvotes: 1