pandas creating new table from two tables

Question

I have to join two tables and create a table with dates, but my code is way to long and I believe that I done it the super long way.Apparently the soulution to this only had 22 lines. Is there another way and more shorter way to approach this problem. Here is the question

HERE IS MY CODE, and again I believe it is to long and I think there is a shorter way to do this.

import numpy as np
import pandas as pd
import datetime

#YOUR CODE GOES HERE#

def get_month(i):
    """this function returns the number of the month based on stringinput"""
    if i == "January":
        return 1
    elif i == "February":
        return 2
    elif i == "March":
        return 3
    elif i == "April":
        return 4
    elif i == "May":
        return 5
    elif i == "June":
        return 6
    elif i == "July":
        return 7
    elif i == "August":
        return 8
    elif i == "September":
        return 9
    elif i == "October":
        return 10
    elif i == "November":
        return 11
    elif i == "December":
        return 12

def get_reformatted_date(s):
    """this function reformats a datetime object to the output we're looking for"""
    return s.strftime("%d-%b-%y")


month_names = []
tab1 = pd.read_csv("data1.csv")
tab2 = pd.read_csv("data2.csv")
tab1_tweets = tab1['Tweet'].tolist()[::-1]
tab2_tweets = tab2['Tweet'].tolist()[::-1]
tab1_months = tab1['Month'].tolist()[::-1]
tab2_months = tab2['Month'].tolist()[::-1]
tab1_days = tab1['Day'].tolist()[::-1]
tab2_days = tab2['Day'].tolist()[::-1]
tab1_years = tab1['Year'].tolist()[::-1]
tab2_years = tab2['Year'].tolist()[::-1]
all_dates = []
all_tweets = []
tab1_count = 0
tab2_count = 0
for i in range(len(tab1_tweets) + len(tab2_tweets)):
    if(tab1_count < len(tab1_years) and tab2_count < len(tab2_years)):
        t1_date = datetime.date(tab1_years[tab1_count], tab1_months[tab1_count], tab1_days[tab1_count])
        t2_date = datetime.date(tab2_years[tab2_count], get_month(tab2_months[tab2_count]), tab2_days[tab2_count])
        if t1_date > t2_date:
            all_dates.append(t1_date)
            all_tweets.append(tab1_tweets[tab1_count])
            tab1_count += 1
        else:
            all_dates.append(t2_date)
            all_tweets.append(tab2_tweets[tab2_count])
            tab2_count += 1
    elif(tab2_count < len(tab2_years)):
        t2_date = datetime.date(tab2_years[tab2_count], get_month(tab2_months[tab2_count]), tab2_days[tab2_count])
        all_dates.append(t2_date)
        all_tweets.append(tab2_tweets[tab2_count])
        tab2_count += 1
    else:
        t1_date = datetime.date(tab1_years[tab1_count], tab1_months[tab1_count], tab1_days[tab1_count])
        all_dates.append(t1_date)
        all_tweets.append(tab1_tweets[tab1_count])
        tab1_count += 1

table_data = {'Date': all_dates, 'Tweet': all_tweets}
df = pd.DataFrame(table_data)
df['Date'] = df['Date'].apply(get_reformatted_date)
print(df)

data1.csv is

Tweet                 Month Day  Year
Hello World             6    2    2013
I want ice-cream!       7    23   2013
Friends will be friends 9    30   2017
Done with school        12   12   2017

the data2.csv is

Month   Day Year    Hour    Tweet
January 2   2015    12  Happy New Year
March   21  2016    7   Today is my final
May     30  2017    23  Summer is about to begin
July    15  2018    11  Ocean is still cold

sacuL · Accepted Answer

I think that you can theoretically do this whole thing in one line:

finaldf = (pd.concat([pd.read_csv('data1.csv',
                            parse_dates={'Date':['Year', 'Month', 'Day']}),
                      pd.read_csv('data2.csv',
                            parse_dates={'Date':['Year', 'Month', 'Day']})
                      [['Date', 'Tweet']]])
            .sort_values('Date', ascending=False))

But for the sake of readability, its better to split it into a few lines:

df1 = pd.read_csv('data1.csv', parse_dates={'Date':['Year', 'Month','Day']})
df2 = pd.read_csv('data2.csv', parse_dates={'Date':['Year', 'Month','Day']})

finaldf = (pd.concat([df1, df2[['Date', 'Tweet']]])
          .sort_values('Date', ascending=False))

I think that for what you're trying to do, the main things to read up about are the parse_dates argument of pandas read_csv, and pd.concat to concatenate dataframes

Edit: in order to get the dates in the correct format as you have in your example output, you can call this after the code above, using Series.dt.strftime():

finaldf['Date'] = finaldf['Date'].dt.strftime('%d-%b-%y')

pandas creating new table from two tables

Answers (1)

Related Questions