Reputation: 287
I have list of 100 dataframes that I am trying to merge into a single dataframe but am unable to do so. All the dataframes have different columns and are of different lengths. To give a bit of context and background, each dataframe consist of 4 sentiment scores (calculated using VaderSentiment). The dataframes have the following representation :
USER 1 DATAFRAME
created_at | positive score of user 1 tweets | negative score of user 1 tweets| neutral score of user 1 tweets | compound score of user 1 tweets |
23/2/2011 10:00 | 1.12 | 1.3 | 1.0 | 3.3 |
24/2/2011 11:00 | 1.20 | 1.1 | 0.9 | 2.5 |
USER 2 DATAFRAME
created_at | positive score of user 1 tweets | negative score of user 1 tweets| neutral score of user 1 tweets | compound score of user 1 tweets |
25/3/2011 23:00 | 0.12 | 1.1 | 0.1 | 1.1 |
26/3/2011 08:00 | 1.40 | 1.5 | 0.4 | 1.5 |
01/4/2011 19:00 | 1.80 | 0.1 | 1.9 | 3.9 |
All the dataframes have one column in common, namely created_at. What I am trying to achieve is to merge all the dataframes based on the created_at column such that I get only one created_at column and all the other columns from all the other dataframes. The result should have **400* columns of sentiment scores and along with on created_at column.
My code is as follows :
import pandas as pd
import glob
import numpy as np
import os
from functools import reduce
path = r'C:\Users\Desktop\Tweets'
allFiles = glob.glob(path + "/*.csv")
list = []
frame = pd.DataFrame()
count=0
for f in allFiles:
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)
#print(df)
list.append(df)
frame = pd.concat(list)
print(frame)
The problem is that when I run the code as above, I get the desired arrangement of columns, but instead of getting the values i get NaN in all the values, thus essentially having a dataframe with 401 columns out of which only the created_at column contains values
Any and all help is appreciated.
Thank you
EDIT
I have tried various different solutions to different questions posted here but none of them seem to work and thus as a last resort I have started this thread
EDIT 2
I have perhaps come up with a solution to my problem. Using the code below, I can append all the columns into frames
. However, this creates a duplicate of created_at column which happens to be type object. If I could merge all the dates into one column, then my troubles would be much closer to being solved.
for f in allFiles :
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)
dates = df.iloc[:,0]
neut = df.iloc[:,1]
pos = df.iloc[:,2]
neg = df.iloc[:,3]
comp = df.iloc[:,4]
all_frames.append(dates)
all_frames.append(neut)
all_frames.append(pos)
all_frames.append(neg)
all_frames.append(comp)
frame = pd.concat(all_frames,axis=1)
Any help would be appreciated
Upvotes: 2
Views: 873
Reputation: 93151
I strongly suggest you revist your data model. Having that many columns usually signals something is wrong. Having said that, here's one way to do it. Also list
is a built-in data type. Don't override it with a variable name.
I assume that other than created_at
, the columns from each file are unique.
all_frames = []
for f in allFiles:
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f, parse_dates=['created_at'], index_col='created_at')
all_frames.append(df)
# This will create a dataframe of size n * 400
# n is the total number of rows between all files
frame = pd.concat(all_frames, join='outer', copy=False, sort=False)
# If you want to line up the hour across all users
frame.groupby(level=0)[frame.columns].first()
Upvotes: 1