Reputation: 105
I’m currently doing some big data work. I have an issue in a .CSV where I need to split a multiple-line single-celled chunk of text, into individual cells. The below table shows the desired output. Currently, all of the 'ingredients' are in the same cell, with each ingredient on its own new line (Stack Overflow wouldn't allow me to create new lines in the same cell).
I need to write a script to split this single cell of ingredients into the below output, using each new line in the cell as a delimiter. The real use case I'm using this for is much more complex - over 200 'items', and anywhere between 50-150 'ingredients' per 'item'. I'm currently doing this manually in excel with a series of text to columns & transpose pastes, but it takes approximately 2-2.5 full work days to do.
Link to data
Code below
Item | Ingredients |
---|---|
Coffee | Coffee beans |
Milk | |
Sugar | |
Water |
import pandas as pd
df = pd.read_csv(r'd:\Python\menu.csv', delimiter=';', header=None)
headers = ["Item", "Ingredients"]
df.columns = headers
df["Ingredients"]=df["Ingredients"].str.split("\n")
df = df.explode("Ingredients").reset_index(drop=True)
df.to_csv(r"D:\Python\output.csv")
Upvotes: 1
Views: 1152
Reputation: 11188
Here's how to do it with Python's standard csv^1 ^2 module:
import csv
writer = csv.writer(open('output.csv', 'w', newline=''))
reader = csv.reader(open('input.csv', newline=''))
writer.writerow(next(reader)) # copy header
for row in reader:
item = row[0]
ingredients = row[1].split('\n')
first_ingredient = ingredients[0]
writer.writerow([item, first_ingredient])
for ingredient in ingredients[1:]:
writer.writerow([None, ingredient]) # None for a blank cell (under the item)
Given your small sample, I get this:
Item | Ingredients |
---|---|
Coffee | Coffee beans |
Milk | |
Sugar | |
Water |
Upvotes: 1
Reputation: 5024
Using your code and linked data change delimeter to a comma like below.
import pandas as pd
df = pd.read_csv('Inventory.csv', delimiter=',')
df["Software"]=df["Software"].str.split("\n")
df = df.explode("Software").reset_index(drop=True)
# Remove rows having empty string under Software column.
df = df[df['Software'].astype(bool)]
df = df.reset_index(drop=True)
df.to_csv("out_Inventory.csv")
print(df.to_string())
Hostname Software
0 ServerName1 Windows Driver Package - Amazon Inc. (AWSNVMe) SCSIAdapter (08/27/2019 1.3.2.53) [version 08/27/2019 1.3.2.53]
1 ServerName1 Airlock Digital Client [version 4.7.1.0]
2 ServerName1 AppFabric 1.1 for Windows Server [version 1.1.2106.32]
3 ServerName1 BlueStripe Collector [version 8.0.3]
...
Upvotes: 2