Reputation: 2881
I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill
twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill
method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.
Upvotes: 1
Views: 149
Reputation: 57251
The DataFrame.apply
method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.
Upvotes: 2