Reputation: 83
My current data looks something like this
+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | V7484 Pipe | nan | Phone: |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+
I am trying to create a new column, df['Company']
, that should contain the what is in df[0]
if it Starts with a "V" and if df[2]
has "Phone" in it. If the condition is not satisfied, then it can be nan
. Below is what I am looking for.
+-------+----------------------------+-------------------+-----------------------+------------+
| Index | 0 | 1 | 2 | Company |
+-------+----------------------------+-------------------+-----------------------+------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date | nan |
| 1 | V50011 Tech | nan | Phone:0177222222 |V50011 Tech |
| 2 | Regis Place | nan | Fax:017757575789 | nan |
| 3 | Catenberry | nan | nan | nan |
| 4 | Manhattan, NY | nan | nan | nan |
| 5 | V7484 Pipe | nan | Phone: | V7484 Pipe |
| 6 | Japan | nan | nan | nan |
| 7 | nan | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan | nan |
+-------+----------------------------+-------------------+-----------------------+------------+
I am trying the below script but I get an error ValueError: Wrong number of items passed 1420
, placement implies 1
df['Company'] = pd.np.where(df[2].str.contains("Ph"), df[0].str.extract(r"(^V[A-Za-z0-9]+)"),"stop")
I put in "stop" as the else part because I don't know how to let python use nan
when the condition is not met.
I would also like to be able to parse out a section of the df[0], for example just the v5001 section, but not rest of the cell contents. I tried something like this using AMCs answer but get an error:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0].str.extract(r"(^V[A-Za-z0-9]+)")
Thank you
Upvotes: 1
Views: 185
Reputation: 1049
Here's another way to get your result
condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')
df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)
Upvotes: 1
Reputation: 23099
IIUC,
we can use either a boolean condition to extract the V Number with some basic regex,
or we can apply the same formula within a where statement.
to set a value to NaN
we can use np.nan
if you want to grab the entire string after V we can use [V]\w+.*
which will grab everything after the first match.
from IO import StringIO
d = """+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | Ultilagro, CT | nan | nan |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+"""
df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()
df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
r"([V]\w+)"
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date nan
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 nan
4 Catenberry nan nan
5 Manhattan, NY nan nan
6 Ultilagro, CT nan nan
7 Japan nan nan
8 nan nan nan
9 4543.34GBP (British Pound) nan nan
if you want as a where statement:
import numpy as np
df["3"] = np.where(
df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date NaN
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 NaN
4 Catenberry nan NaN
5 Manhattan, NY nan NaN
6 Ultilagro, CT nan NaN
7 Japan nan NaN
8 nan nan NaN
9 4543.34GBP (British Pound) nan NaN
Upvotes: 0
Reputation: 2083
You can do it with the pandas apply
function:
import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)
To adjust to comment under @AMC's answer
Upvotes: 0
Reputation: 2298
A potential solution to this would be to use list comprehension. You could probably get a speed boost using some of pandas' built in functions but this will get you there.
#!/usr/bin/env python
import numpy as np
import pandas as pd
df = pd.DataFrame({
0:["reference", "v5001 tech comp", "catenberry", "very different"],
1:["not", "phone", "other", "text"]
})
df["new_column"] = [x if (x[0].lower() == "v") & ("phone" in y.lower())
else np.nan for x,y in df.loc[:, [0,1]].values]
print(df)
Which will produce
0 1 new_column
0 reference not NaN
1 v5001 tech comp phone v5001 tech comp
2 catenberry other NaN
3 very different text NaN
All I'm doing is taking your two conditions and building a new list which will then be assigned to your new column.
Upvotes: 2
Reputation: 2702
You haven't provided an easy way for us to test potential solutions, but this should do the job:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0]
Upvotes: 2