Reputation: 121
A continuation on a previous post. Previously, I had help creating a new column in a dataframe using Pandas, and each value would represent a factorized or unique value based on another column's value. I used this on a test case and it successfully works, but I am having trouble with a much larger log and htm file to do the same process for. I have 12 log files (for each month) and after combining them, I get a 17Gb file to work with. I want to factorize each and every username on it. I have been looking into using Dask, however, I can't replicate the functionality of sort and factorize to do what I want for the Dask dataframe. Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
import pandas as pd
import numpy as np
#import dask.dataframe as pf
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
df.sort_values(['fruit'], ascending=True, inplace=True)
sorting the column fruit
df.reset_index(drop=True, inplace=True)
f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df = df.assign(NewCol=n)
#print(df)
df.to_csv('output.csv')
Upvotes: 1
Views: 560
Reputation: 57251
Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
The answer to this question depends on a great many things and is probably too general to get a good answer on Stack Overflow.
However, there are a few particular questions you bring up that are easier to answer
How do I factorize a column?
The easy way here is to categorize a column:
df = df.categorize(columns=['fruit'])
How do I sort unique values within a column
You can always set the column as the index, which will cause a sort. However beware that sorting in a distributed setting can be quite expensive.
However if you want to sort a column with a small number of options then you might find the unique values, sort those in-memory, and then join those back onto the dataframe. Something like the following might work:
unique_fruit = df.fruit.drop_duplicates().compute() # this is now a pandas series
unique_fruit = unique_fruit.sort_values()
numbers = pd.Series(unique_fruit.index, index=unique_fruit.values, name='fruit')
df = df.merge(numbers.to_frame(), left_on='fruit', right_index=True)
Upvotes: 1