Baobab1988
Baobab1988

Reputation: 715

How to separate monthly and daily filenames on python list into two separate lists?

I have a list:

allFiles =['https://myurl.com/something//something_01-01-2020.csv',
           'https://myurl.com/something//something_01-02-2020.csv', 
           'https://myurl.com/something//something_03-2020.csv'
           'https://myurl.com/something//something_01-03-2020.csv',
           'https://myurl.com/something//something_04-2020.csv'...]

How can I separate monthly and daily files into two separate lists?

Desired output:

   daily = ['https://myurl.com/something//something_01-01-2020.csv',
           'https://myurl.com/something//something_01-02-2020.csv', 
           'https://myurl.com/something//something_01-03-2020.csv']

   monthly = ['https://myurl.com/something//something_03-2020.csv',
              'https://myurl.com/something//something_04-2020.csv']

I was trying the bellow but unsuccessfully:

 daily = [ x for x in allFiles if "%m-%Y.csv" not in x ]

Could someone please help? Thank you in advance!

Upvotes: 0

Views: 48

Answers (8)

Roman Pavelka
Roman Pavelka

Reputation: 4181

  1. Notice that you have a mistake in your example, there is a missing comma.

  2. Your issue is well-suited for regular expressions.

import re

allFiles =['https://myurl.com/something//something_01-01-2020.csv',
           'https://myurl.com/something//something_01-02-2020.csv', 
           'https://myurl.com/something//something_03-2020.csv',
           'https://myurl.com/something//something_01-03-2020.csv',
           'https://myurl.com/something//something_04-2020.csv']

dailyRegexp = re.compile(r".*\d\d-\d\d-\d\d\d\d\.csv$")
isDaily = lambda fn: dailyRegexp.match(fn)

daily = [fn for fn in allFiles if isDaily(fn)]
monthly = [fn for fn in allFiles if not isDaily(fn)]

print("Daily:", daily)
print("Monthly:", monthly)

Explanation of the regexp:

  • .* is any character (.), repeated arbitrary times (*)
  • \d is any digit
  • - is just literal character -, no special meaning
  • \. is a dot character (escaped by backslash to prevent special meaning)
  • csv is literal string, no special meaning
  • $ is end of string

Notice also r before the string. It signifies raw string that prevents Python to interpret \ as special character. More info:

Upvotes: 0

Adithya
Adithya

Reputation: 1728

May be like this

month_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==2]

day_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==3]

rpartition will split the file on _ and give you 3 items like ['somename','_','the date/month .csv'] you can filter on the date part with split and length checking.

with rpartition it'll work even if the filename has multiple _.

Upvotes: 0

Mark Moretto
Mark Moretto

Reputation: 2348

Using regular expressions:

import re

daily_pattern = r"""
    ^       # Start of string
    .+?     # Match anything except newline (not greedy)
    \d{2}   # Two numerical values.
    -       # Hyphen
    \d{2}   # Two numerical values.
    -       # Hyphen
    \d{4}   # Four numerical values.
    \.\w+   # File extension with escaped period.
    $       # End of string
"""

# Compile with re.M (ignore case) and re.X (handle pattern verbosity)
p = re.compile(daily_pattern, flags=re.I | re.X)

daily = [f for f in allFiles if p.match(f)]
monthly = [f for f in allFiles if not f in daily]

EDIT: Updated to include more explanation.

Upvotes: 0

hhaefliger
hhaefliger

Reputation: 531

Assuming that there is no other "_" in the file name:

monthly = [file for file in allFiles if len(file.split('_')[1].split('-')) == 2]
daily = [file for file in allFiles if len(file.split('_')[1].split('-')) == 3]

Upvotes: 0

Rakesh
Rakesh

Reputation: 82785

You can use Regex here

Ex:

import re

allFiles =['https://myurl.com/something//something_01-01-2020.csv',
           'https://myurl.com/something//something_01-02-2020.csv', 
           'https://myurl.com/something//something_03-2020.csv',
           'https://myurl.com/something//something_01-03-2020.csv',
           'https://myurl.com/something//something_04-2020.csv']

daily = []
monthly = []

for i in allFiles:
    if re.search(r"_(\d+\-\d+\.csv)$", i):
        monthly.append(i)
    else:
        daily.append(i)

print(daily)
print(monthly)

Output:

['https://myurl.com/something//something_01-01-2020.csv', 'https://myurl.com/something//something_01-02-2020.csv', 'https://myurl.com/something//something_01-03-2020.csv']

['https://myurl.com/something//something_03-2020.csv', 'https://myurl.com/something//something_04-2020.csv']

Upvotes: 0

first create a function allowing to sort the URLs in order to classify those being days and those being months

 allFiles =['https://myurl.com/something//something_01-01-2020.csv',
       'https://myurl.com/something//something_01-02-2020.csv', 
       'https://myurl.com/something//something_03-2020.csv'
       'https://myurl.com/something//something_01-03-2020.csv',
       'https://myurl.com/something//something_04-2020.csv']

def month_or_day(string):
    return len(string.split('_')[1].split(".")[0].split('-'))

Then create a dataframe to apply this function to each URL

df=pd.DataFrame(allFiles,columns=['URL'])
df['Month_day']=0
df['intermediate'] =  pd.Series(allFiles).apply(lambda x : month_or_day(x))

You get the URLS separation as follows:

print('Month : ',df[df['intermediate']==2]['URL'].tolist())
print('')
print('Day : ',df[df['intermediate']==3]['URL'].tolist())

Upvotes: 0

pythomatic
pythomatic

Reputation: 657

You can split the url to get only the part you want, then count the hyphens to see the format of the date:

monthly = []
daily = []
for url in all_files:
  # splits the url string by '/', returns only the part after the last '/'
  filename = url.rsplit('/', 1)[-1]
  # same as before but split by '_' and getting only similar to 01-01-2020.csv
  
  datestring = filename.rsplit('_', 1)[-1]
  datestring_hyphens = datestring.count('-')

  if datestring_hyphens == 1:
    monthly.append(datestring)
  elif date_string_hyphens == 2:
    daily.append(datestring)

Upvotes: 0

sushanth
sushanth

Reputation: 8302

Here is a solution making use of regex to identify daily and monthly date pattern's,

import re

daily_pattern = re.compile(r"\d{2}-\d{2}-\d{4}.csv")
monthly_pattern = re.compile(r"\d{2}-\d{4}.csv")

monthly, daily = [], []

for f in allFiles:
    if daily_pattern.search(f):
        daily.append(f)
    elif monthly_pattern.search(f):
        monthly.append(f)
    else:
        print('invalid pattern %s' % f)

Upvotes: 1

Related Questions