Robin
Robin

Reputation: 63

String length from DataFrame Column

I have a .csv inside a .gzip which I want to extract some data from. I specifically want to take the average character length from a thread and a comment inside the .csv file (it is a huge reddit forum) and plot them into another chart.

This is the function in Python:

def average_data_threads_and_comments():
# Create columns for the DF and add them
df['thread_title_character_count'] = df['thread_title'].str.len()
df['comment_body_character_count'] = df['comment_body'].str.len()

# Calculation
df_thread_mean_character_count = df.groupby('thread_id')['thread_title_character_count'].mean().mean()
df_thread_mean_score = df.groupby('thread_id')['thread_score'].mean().mean()
df_comment_mean_character_count = df.groupby('comment_id')['comment_body_character_count'].mean().mean()
df_comment_mean_score = df.groupby('comment_id')['comment_score'].mean().mean()

# Create DataFrame for the output
df_average_characters = pd.DataFrame({'thread_title_character_count': [df_thread_mean_character_count],'comment_body_character_count': [df_comment_mean_character_count]})
# Create Plot and Safe
df_average_characters.plot(kind='bar')
plt.xlabel('Typ')
plt.ylabel('Average Characters: ')
plt.savefig(output_folder+'5_average_characters_python.png', bbox_inches="tight")
plt.close()

# Create DataFrame for the output
df_average_score = pd.DataFrame({'thread_score': [df_thread_mean_score],'comment_score': [df_comment_mean_score]})
# Plot erstellen und abspeichern
df_average_score.plot(kind='bar')
plt.xlabel('Typ')
plt.ylabel('Average Score')
plt.savefig(output_folder+'6_average_score_python.png', bbox_inches="tight")
plt.close()

This is what i have so far in Julia:

function average_data_threads_and_comments()
   df.threadtitlecharactercount = length(df.thread_title)
   df.commentcharactercount = length(df.comment_body.length)

   thread_mean_character_count = mean(combine(groupby(df, :threadtitlecharactercount),
    :thread_id => length ∘ unique => :thread_mean_count))
   thread_mean_score_count = mean(combine(groupby(df, :thread_score),
    :thread_id => length ∘ unique => :thread_mean_score))
   comment_mean_character_count = mean(combine(groupby(df, :commentcharactercount),
    :comment_id => length ∘ unique => :comment_mean_count))
   comment_mean_score_count = mean(combine(groupby(df, :comment_score),
    :comment_id => length ∘ unique => :comment_mean_score))

   bar3 = bar(thread_mean_character_count.threadtitlecharactercount,
    comment_mean_character_count.commentcharactercount)
   xlabel!("Typ")
   ylabel!("Durchschnittliche Zeichenlänge")
   title!("Durchschnittliche Anzahl der Zeichen pro Thread und Kommentar")
   png(bar3,output_folder*"5_durchschnittliche_anzahl_zeichen_julia.png")

   bar4 = bar(thread_mean_score_count.thread_score, comment_mean_score_count.comment_score)
   xlabel!("Typ")
   ylabel!("Durchschnittlicher Score")
   title!("Durchschnittliche Anzahl der Punkte pro Thread und Kommentar")
   png(bar4,output_folder*"6_durchschnittliche_anzahl_punkte_julia.png")
end

Can someone help?

EDIT:

Sadly, i have still issues with my Code. This is what i have nowm after many help:

    function average_data_threads_and_comments()
   df.thread_title_character_count = passmissing(length).(df.thread_title)
   df.comment_character_count = passmissing(length).(df.comment_body)
   
   df_thread_character_count = combine(groupby(df, :thread_id),
   [:thread_title_character_count] .=> first ∘ skipmissing .=> [:df_thread_mean_character_count])
   df_thread_score = combine(groupby(df, :thread_score),
   [:thread_score] .=> first ∘ skipmissing .=> [:df_thread_score_mean])
   df_comment_character_count = combine(groupby(df, :comment_id),
   [:comment_character_count] .=> mean ∘ skipmissing .=> [:df_comment_character_count_mean])
   df_comment_score = combine(groupby(df, :comment_score),
   [:comment_score] .=> mean ∘ skipmissing .=> [:df_comment_score_mean])

   
   bar3 = bar(df_thread_character_count.df_thread_mean_character_count, df_comment_character_count.df_comment_character_count_mean)
   xlabel!("Typ")
   ylabel!("Average Character length")
   title!("Average Number of Characters per Thread and Comment")
   png(bar3,output_folder*"5_average_character_length_julia.png")
   
   bar4 = bar(df_thread_score.df_thread_score_mean, df_comment_score.df_comment_score_mean)
   xlabel!("Typ")
   ylabel!("Average Score")
   title!("Average Score for Comments and Threads")
   png(bar4,output_folder*"6_average_score_julia.png")
   end

I get this : "bar recipe: x must be same length as y (centers), or one more than y (edges). " Error message in my REPL, but the, semingly, same Code runs fine in Python. Can someone see the Mistake?

Upvotes: 2

Views: 937

Answers (2)

Sundar R
Sundar R

Reputation: 14705

df.threadtitlecharactercount = length(df.thread_title)

You want the length of each string in the thread_title column, but length(df.thread_title) takes the length of the column itself i.e. how many rows of data it has. To apply length to each element of the column, use length.(df.thread_title) instead. If there's missing data in the column, you'll need passmissing(length).(df.thread_title) instead.

length(df.comment_body.length)

This should probably also be just passmissing(length).(df.comment_body).

thread_mean_character_count = mean(combine(groupby(df, :threadtitlecharactercount), :thread_id => length ∘ unique => :thread_mean_count))

It seems you want to group by thread_id first. Just like you do df.groupby('thread_id') in Python, it should be groupby(df, :thread_id) here.

However, at this point, I'm a bit confused by the logic of the code, whether in Python or Julia. Logically, one would assume that for each thread_id, there would be a unique thread_title and if a given thread id appears ten times, the corresponding thread_title_character_count would be the same each of those ten times. Can you confirm whether that's correct - given a fixed value in thread_id, there should be a fixed value in thread_title too, can you verify that is the case right after the dataframe is loaded from the CSV?

If that is so, here you want:

using Statistics: mean

counts_and_scores = combine(groupby(df, :thread_id), 
                      [:thread_title_character_count, :thread_score] .=> first ∘ skipmissing .=> identity, 
                      [:comment_character_count, :comment_score] .=> mean ∘ skipmissing .=> [:comment_character_count_mean, :comment_score_mean])

After this, you can plot the things you want like
bar(counts_and_scores.thread_title_character_count, counts_and_scores.comment_character_count_mean)
and
bar(counts_and_scores.thread_score, counts_and_scores.comment_score_mean).

Upvotes: 2

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

You can do e.g.

using DataFramesMeta

agg_df = @chain df begin
    @rtransform(:threadtitlecharactercount = length(:thread_title))
    groupby(:thread_id)
    @combine(:threadtitlecharactercount_avg = mean(:threadtitlecharactercount))
    @transform(:thread_mean_score = mean(:threadtitlecharactercount_avg))
end

From agg_df data frame you should be able to do the plot you want. I did this for threads, for comments code would be similar.

Upvotes: 2

Related Questions