kn2298
kn2298

Reputation: 49

Extract number from text string and move it into separate column in dataframe

I have a dataframe for well data that imports as one column and I have split like this:

df = pd.read_csv(r'df.txt', header=None)
df.columns = ['Test']
df = df.Test.str.split(expand=True)

         0          1           2        3      4       5         6
0      well         1       20060201  3623.23  0.00   1300.00    None
1      well         1       20060202  3627.07  0.00   1305.00    None
2      well         1       20060203  3576.48  0.00   1305.00    None
...        ... ..       ...      ...   ...      ...   ...
42089  well14    20201114     0.00     0.00    0.00    None
42090  well14    20201115     0.00     0.00    0.00    None
...        ... ..       ...      ...   ...      ...   ...
51000  well         7       20201116   0.00    0.00    0.00      None
51001  well         7       20201117   0.00    0.00    0.00      None

As you can see, some of my data contains well names that have their number attached to the name. The code reads this as one entity and puts it in column 0. Wells that had a space between the name and number were sorted into columns 0 and 1. Because of the shift, an extra column is added (col 6).

I have tried using str.extract() and str.split() to separate the number from the well name and move it into column 1, so that I could then use str.cat() to join columns 0 and 1 together, but I can't get it to work. I need this to work for varying well names and numbers.

In the end, I need my data to look like this:

          0          1           2       3      4                
0      well 1     20060201    3623.23  0.00   1300.00    
1      well 1     20060202    3627.07  0.00   1305.00   
2      well 1     20060203    3576.48  0.00   1305.00    
...        ... ..       ...      ...   ...      ...   
42089  well 14    20201114     0.00    0.00    0.00    
42090  well 14    20201115     0.00    0.00    0.00    
...        ... ..       ...      ...   ...      ...   
51000  well 7     20201116     0.00    0.00    0.00     
51001  well 7     20201117     0.00    0.00    0.00      

Would appreciate any help with this. Thank you!

Upvotes: 1

Views: 311

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

Let's try:

# extract the names and digits
tmp = df[0].str.extract('^(.*\D)(\d+)?$')

# where the names are attached to digits
name_with_digits = tmp[1].notna()

# shift these values horizotally
df.loc[name_with_digits, 1:] = df.loc[name_with_digits, 1:].shift(axis=1)

# update the names
df.loc[name_with_digits,[0,1]] = tmp

# concatenate the names
df[0] = df[0] + ' ' + df[1].astype(str)

# drop unnecessary columns
df = df.drop([1,6], axis=1)

Output:

             0           2        3    4        5
0       well 1  20060201.0  3623.23  0.0  1300.00
1       well 1  20060202.0  3627.07  0.0  1305.00
2       well 1  20060203.0  3576.48  0.0  1305.00
42089  well 14    20201114      0.0  0.0      0.0
42090  well 14    20201115      0.0  0.0      0.0

Upvotes: 1

Related Questions