Shmelky
Shmelky

Reputation: 83

How do I create a new column if the text from one column if the text from a second column contains a specific string pattern?

My current data looks something like this

+-------+----------------------------+-------------------+-----------------------+
| Index |             0              |         1         |           2           |
+-------+----------------------------+-------------------+-----------------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date |
|     1 | V50011 Tech Comp           | nan               | Phone:0177222222      |
|     2 | Regis Place                | nan               | Fax:017757575789      |
|     3 | Catenberry                 | nan               | nan                   |
|     4 | Manhattan, NY              | nan               | nan                   |
|     5 | V7484 Pipe                 | nan               | Phone:                   |
|     6 | Japan                      | nan               | nan                   |
|     7 | nan                        | nan               | nan                   |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   |
+-------+----------------------------+-------------------+-----------------------+

I am trying to create a new column, df['Company'], that should contain the what is in df[0] if it Starts with a "V" and if df[2] has "Phone" in it. If the condition is not satisfied, then it can be nan. Below is what I am looking for.

+-------+----------------------------+-------------------+-----------------------+------------+
| Index |             0              |         1         |           2           | Company    |
+-------+----------------------------+-------------------+-----------------------+------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date | nan        |
|     1 | V50011 Tech                | nan               | Phone:0177222222      |V50011 Tech |
|     2 | Regis Place                | nan               | Fax:017757575789      | nan        |
|     3 | Catenberry                 | nan               | nan                   | nan        |
|     4 | Manhattan, NY              | nan               | nan                   | nan        |
|     5 | V7484 Pipe                 | nan               | Phone:                | V7484 Pipe |
|     6 | Japan                      | nan               | nan                   | nan        |
|     7 | nan                        | nan               | nan                   | nan        |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   | nan        |
+-------+----------------------------+-------------------+-----------------------+------------+

I am trying the below script but I get an error ValueError: Wrong number of items passed 1420, placement implies 1

df['Company'] = pd.np.where(df[2].str.contains("Ph"), df[0].str.extract(r"(^V[A-Za-z0-9]+)"),"stop")

I put in "stop" as the else part because I don't know how to let python use nan when the condition is not met.

I would also like to be able to parse out a section of the df[0], for example just the v5001 section, but not rest of the cell contents. I tried something like this using AMCs answer but get an error:

df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0].str.extract(r"(^V[A-Za-z0-9]+)")

Thank you

Upvotes: 1

Views: 185

Answers (5)

Nev1111
Nev1111

Reputation: 1049

Here's another way to get your result

condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')

df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

IIUC,

we can use either a boolean condition to extract the V Number with some basic regex,

or we can apply the same formula within a where statement.

to set a value to NaN we can use np.nan

if you want to grab the entire string after V we can use [V]\w+.* which will grab everything after the first match.

from IO import StringIO

 d = """+-------+----------------------------+-------------------+-----------------------+
| Index |             0              |         1         |           2           |
+-------+----------------------------+-------------------+-----------------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date |
|     1 | V50011 Tech Comp           | nan               | Phone:0177222222      |
|     2 | Regis Place                | nan               | Fax:017757575789      |
|     3 | Catenberry                 | nan               | nan                   |
|     4 | Manhattan, NY              | nan               | nan                   |
|     5 | Ultilagro, CT              | nan               | nan                   |
|     6 | Japan                      | nan               | nan                   |
|     7 | nan                        | nan               | nan                   |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   |
+-------+----------------------------+-------------------+-----------------------+"""

df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()

df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
    r"([V]\w+)"
)

print(df[['0','2','3']])
                           0                      2       3
1              Reference Curr  Invoice Date Due Date     nan
2            V50011 Tech Comp       Phone:0177222222  V50011
3                 Regis Place       Fax:017757575789     nan
4                  Catenberry                    nan     nan
5               Manhattan, NY                    nan     nan
6               Ultilagro, CT                    nan     nan
7                       Japan                    nan     nan
8                         nan                    nan     nan
9  4543.34GBP (British Pound)                    nan     nan

if you want as a where statement:

import numpy as np



df["3"] = np.where(
    df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
            print(df[['0','2','3']])
                                   0                      2       3
        1              Reference Curr  Invoice Date Due Date     NaN
        2            V50011 Tech Comp       Phone:0177222222  V50011
        3                 Regis Place       Fax:017757575789     NaN
        4                  Catenberry                    nan     NaN
        5               Manhattan, NY                    nan     NaN
        6               Ultilagro, CT                    nan     NaN
        7                       Japan                    nan     NaN
        8                         nan                    nan     NaN
        9  4543.34GBP (British Pound)                    nan     NaN

Upvotes: 0

LeoE
LeoE

Reputation: 2083

You can do it with the pandas apply function:

import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)

Edit:

To adjust to comment under @AMC's answer

Upvotes: 0

PatientOtter
PatientOtter

Reputation: 2298

A potential solution to this would be to use list comprehension. You could probably get a speed boost using some of pandas' built in functions but this will get you there.

#!/usr/bin/env python

import numpy as np
import pandas as pd

df = pd.DataFrame({
    0:["reference", "v5001 tech comp", "catenberry", "very different"],
    1:["not", "phone", "other", "text"]
    })

df["new_column"] = [x  if (x[0].lower() == "v") & ("phone" in y.lower())
  else np.nan for x,y in df.loc[:, [0,1]].values]

print(df)

Which will produce

                 0      1       new_column
0        reference    not              NaN
1  v5001 tech comp  phone  v5001 tech comp
2       catenberry  other              NaN
3   very different   text              NaN

All I'm doing is taking your two conditions and building a new list which will then be assigned to your new column.

Upvotes: 2

AMC
AMC

Reputation: 2702

You haven't provided an easy way for us to test potential solutions, but this should do the job:

df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0]

Upvotes: 2

Related Questions