Reputation: 63
I have a .csv inside a .gzip which I want to extract some data from. I specifically want to take the average character length from a thread and a comment inside the .csv file (it is a huge reddit forum) and plot them into another chart.
This is the function in Python:
def average_data_threads_and_comments():
# Create columns for the DF and add them
df['thread_title_character_count'] = df['thread_title'].str.len()
df['comment_body_character_count'] = df['comment_body'].str.len()
# Calculation
df_thread_mean_character_count = df.groupby('thread_id')['thread_title_character_count'].mean().mean()
df_thread_mean_score = df.groupby('thread_id')['thread_score'].mean().mean()
df_comment_mean_character_count = df.groupby('comment_id')['comment_body_character_count'].mean().mean()
df_comment_mean_score = df.groupby('comment_id')['comment_score'].mean().mean()
# Create DataFrame for the output
df_average_characters = pd.DataFrame({'thread_title_character_count': [df_thread_mean_character_count],'comment_body_character_count': [df_comment_mean_character_count]})
# Create Plot and Safe
df_average_characters.plot(kind='bar')
plt.xlabel('Typ')
plt.ylabel('Average Characters: ')
plt.savefig(output_folder+'5_average_characters_python.png', bbox_inches="tight")
plt.close()
# Create DataFrame for the output
df_average_score = pd.DataFrame({'thread_score': [df_thread_mean_score],'comment_score': [df_comment_mean_score]})
# Plot erstellen und abspeichern
df_average_score.plot(kind='bar')
plt.xlabel('Typ')
plt.ylabel('Average Score')
plt.savefig(output_folder+'6_average_score_python.png', bbox_inches="tight")
plt.close()
This is what i have so far in Julia:
function average_data_threads_and_comments()
df.threadtitlecharactercount = length(df.thread_title)
df.commentcharactercount = length(df.comment_body.length)
thread_mean_character_count = mean(combine(groupby(df, :threadtitlecharactercount),
:thread_id => length ∘ unique => :thread_mean_count))
thread_mean_score_count = mean(combine(groupby(df, :thread_score),
:thread_id => length ∘ unique => :thread_mean_score))
comment_mean_character_count = mean(combine(groupby(df, :commentcharactercount),
:comment_id => length ∘ unique => :comment_mean_count))
comment_mean_score_count = mean(combine(groupby(df, :comment_score),
:comment_id => length ∘ unique => :comment_mean_score))
bar3 = bar(thread_mean_character_count.threadtitlecharactercount,
comment_mean_character_count.commentcharactercount)
xlabel!("Typ")
ylabel!("Durchschnittliche Zeichenlänge")
title!("Durchschnittliche Anzahl der Zeichen pro Thread und Kommentar")
png(bar3,output_folder*"5_durchschnittliche_anzahl_zeichen_julia.png")
bar4 = bar(thread_mean_score_count.thread_score, comment_mean_score_count.comment_score)
xlabel!("Typ")
ylabel!("Durchschnittlicher Score")
title!("Durchschnittliche Anzahl der Punkte pro Thread und Kommentar")
png(bar4,output_folder*"6_durchschnittliche_anzahl_punkte_julia.png")
end
Can someone help?
EDIT:
Sadly, i have still issues with my Code. This is what i have nowm after many help:
function average_data_threads_and_comments()
df.thread_title_character_count = passmissing(length).(df.thread_title)
df.comment_character_count = passmissing(length).(df.comment_body)
df_thread_character_count = combine(groupby(df, :thread_id),
[:thread_title_character_count] .=> first ∘ skipmissing .=> [:df_thread_mean_character_count])
df_thread_score = combine(groupby(df, :thread_score),
[:thread_score] .=> first ∘ skipmissing .=> [:df_thread_score_mean])
df_comment_character_count = combine(groupby(df, :comment_id),
[:comment_character_count] .=> mean ∘ skipmissing .=> [:df_comment_character_count_mean])
df_comment_score = combine(groupby(df, :comment_score),
[:comment_score] .=> mean ∘ skipmissing .=> [:df_comment_score_mean])
bar3 = bar(df_thread_character_count.df_thread_mean_character_count, df_comment_character_count.df_comment_character_count_mean)
xlabel!("Typ")
ylabel!("Average Character length")
title!("Average Number of Characters per Thread and Comment")
png(bar3,output_folder*"5_average_character_length_julia.png")
bar4 = bar(df_thread_score.df_thread_score_mean, df_comment_score.df_comment_score_mean)
xlabel!("Typ")
ylabel!("Average Score")
title!("Average Score for Comments and Threads")
png(bar4,output_folder*"6_average_score_julia.png")
end
I get this : "bar recipe: x must be same length as y (centers), or one more than y (edges). " Error message in my REPL, but the, semingly, same Code runs fine in Python. Can someone see the Mistake?
Upvotes: 2
Views: 937
Reputation: 14705
df.threadtitlecharactercount = length(df.thread_title)
You want the length of each string in the thread_title
column, but length(df.thread_title)
takes the length of the column itself i.e. how many rows of data it has. To apply length
to each element of the column, use length.(df.thread_title)
instead. If there's missing
data in the column, you'll need passmissing(length).(df.thread_title)
instead.
length(df.comment_body.length)
This should probably also be just passmissing(length).(df.comment_body)
.
thread_mean_character_count = mean(combine(groupby(df, :threadtitlecharactercount), :thread_id => length ∘ unique => :thread_mean_count))
It seems you want to group by thread_id
first. Just like you do df.groupby('thread_id')
in Python, it should be groupby(df, :thread_id)
here.
However, at this point, I'm a bit confused by the logic of the code, whether in Python or Julia. Logically, one would assume that for each thread_id
, there would be a unique thread_title
and if a given thread id appears ten times, the corresponding thread_title_character_count
would be the same each of those ten times. Can you confirm whether that's correct - given a fixed value in thread_id
, there should be a fixed value in thread_title
too, can you verify that is the case right after the dataframe is loaded from the CSV?
If that is so, here you want:
using Statistics: mean
counts_and_scores = combine(groupby(df, :thread_id),
[:thread_title_character_count, :thread_score] .=> first ∘ skipmissing .=> identity,
[:comment_character_count, :comment_score] .=> mean ∘ skipmissing .=> [:comment_character_count_mean, :comment_score_mean])
After this, you can plot the things you want like
bar(counts_and_scores.thread_title_character_count, counts_and_scores.comment_character_count_mean)
and
bar(counts_and_scores.thread_score, counts_and_scores.comment_score_mean)
.
Upvotes: 2
Reputation: 69949
You can do e.g.
using DataFramesMeta
agg_df = @chain df begin
@rtransform(:threadtitlecharactercount = length(:thread_title))
groupby(:thread_id)
@combine(:threadtitlecharactercount_avg = mean(:threadtitlecharactercount))
@transform(:thread_mean_score = mean(:threadtitlecharactercount_avg))
end
From agg_df
data frame you should be able to do the plot you want.
I did this for threads, for comments code would be similar.
Upvotes: 2