Reputation: 5
I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT
is Banana
, I would want to change the value in the same row under TIME
to 10m
. If the the product was Apple
I would want the time to be 15m
and so on.
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 5m
2 Apple 400000 F4 8m
3 Pair 6m
4 Banana 4000 G3 7m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 6m
Desired Output
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
I want to output all of them to be outputed to a directory call NTime
. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.
import pandas as pd
import glob
import os
fns = glob.glob('*.csv')
colname1 = 'PRODUCT'
colname2 = 'TIME'
for csv in fns:
s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)
with open(os.path.join('NTime', fn), 'wb') as f:
Can someone help me?
Upvotes: 0
Views: 235
Reputation: 25662
You can do this with a combination of groupby
, replace
and a dict
In [76]: from pandas import DataFrame
In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']
In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']
In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}
In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
fruits time
0 banana 5m
1 apple 8m
2 pear 6m
3 banana 7m
4 watermelon 13m
5 orange 2m
6 apple 6m
In [81]: def replacer(g, time_map):
....: tv = g.time.values
....: return g.replace(to_replace=tv, value=time_map.get(g.name, tv))
In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 13m
5 orange 2m
6 apple 15m
You said you're new to programming so I'll explain what's going on.
df.groupby('fruits')
splits the DataFrame
into subsets (which are DataFrame
s or Series
objects) using the values of the fruits
column.
The apply
method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer
is where the "magic" happens: each group's time
values get replaced (to_replace
) with the new value that's defined in time_map
. The get
method of dict
s allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan
is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map
dict
.
One thing to note is my use of g.name
. This doesn't normally exist as an attribute on DataFrame
s (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.
If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:
In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}
In [131]: s = Series(time_map, name='time')
In [132]: s[df.fruits]
Out[132]:
fruits
banana 10m
apple 15m
pear 5m
banana 10m
watermelon 100m
orange 10m
apple 15m
Name: time, dtype: object
In [133]: s[df.fruits].reset_index()
Out[133]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 100m
5 orange 10m
6 apple 15m
Upvotes: 1
Reputation: 789
Assuming that your data is in a Pandas DataFrame and looks something like this:
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:
for numi, i in enumerate(df["PRODUCT"]):
if i == "Banana":
df["TIME"][numi] = "10m"
if i == "Apple":
df["TIME"][numi] = "15m"
The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.
There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.
Upvotes: 0