Reputation: 715
I have a list:
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv'
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv'...]
How can I separate monthly and daily files into two separate lists?
Desired output:
daily = ['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv']
monthly = ['https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
I was trying the bellow but unsuccessfully:
daily = [ x for x in allFiles if "%m-%Y.csv" not in x ]
Could someone please help? Thank you in advance!
Upvotes: 0
Views: 48
Reputation: 4181
Notice that you have a mistake in your example, there is a missing comma.
Your issue is well-suited for regular expressions.
import re
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
dailyRegexp = re.compile(r".*\d\d-\d\d-\d\d\d\d\.csv$")
isDaily = lambda fn: dailyRegexp.match(fn)
daily = [fn for fn in allFiles if isDaily(fn)]
monthly = [fn for fn in allFiles if not isDaily(fn)]
print("Daily:", daily)
print("Monthly:", monthly)
Explanation of the regexp:
.*
is any character (.
), repeated arbitrary times (*
)\d
is any digit-
is just literal character -
, no special meaning\.
is a dot character (escaped by backslash to prevent special meaning)csv
is literal string, no special meaning$
is end of stringNotice also r
before the string. It signifies raw string that prevents Python to interpret \
as special character. More info:
Upvotes: 0
Reputation: 1728
May be like this
month_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==2]
day_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==3]
rpartition
will split the file on _
and give you 3 items like ['somename','_','the date/month .csv']
you can filter on the date part with split and length checking.
with rpartition
it'll work even if the filename has multiple _
.
Upvotes: 0
Reputation: 2348
Using regular expressions:
import re
daily_pattern = r"""
^ # Start of string
.+? # Match anything except newline (not greedy)
\d{2} # Two numerical values.
- # Hyphen
\d{2} # Two numerical values.
- # Hyphen
\d{4} # Four numerical values.
\.\w+ # File extension with escaped period.
$ # End of string
"""
# Compile with re.M (ignore case) and re.X (handle pattern verbosity)
p = re.compile(daily_pattern, flags=re.I | re.X)
daily = [f for f in allFiles if p.match(f)]
monthly = [f for f in allFiles if not f in daily]
EDIT: Updated to include more explanation.
Upvotes: 0
Reputation: 531
Assuming that there is no other "_" in the file name:
monthly = [file for file in allFiles if len(file.split('_')[1].split('-')) == 2]
daily = [file for file in allFiles if len(file.split('_')[1].split('-')) == 3]
Upvotes: 0
Reputation: 82785
You can use Regex here
Ex:
import re
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
daily = []
monthly = []
for i in allFiles:
if re.search(r"_(\d+\-\d+\.csv)$", i):
monthly.append(i)
else:
daily.append(i)
print(daily)
print(monthly)
Output:
['https://myurl.com/something//something_01-01-2020.csv', 'https://myurl.com/something//something_01-02-2020.csv', 'https://myurl.com/something//something_01-03-2020.csv']
['https://myurl.com/something//something_03-2020.csv', 'https://myurl.com/something//something_04-2020.csv']
Upvotes: 0
Reputation: 398
first create a function allowing to sort the URLs in order to classify those being days and those being months
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv'
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
def month_or_day(string):
return len(string.split('_')[1].split(".")[0].split('-'))
Then create a dataframe to apply this function to each URL
df=pd.DataFrame(allFiles,columns=['URL'])
df['Month_day']=0
df['intermediate'] = pd.Series(allFiles).apply(lambda x : month_or_day(x))
You get the URLS separation as follows:
print('Month : ',df[df['intermediate']==2]['URL'].tolist())
print('')
print('Day : ',df[df['intermediate']==3]['URL'].tolist())
Upvotes: 0
Reputation: 657
You can split the url to get only the part you want, then count the hyphens to see the format of the date:
monthly = []
daily = []
for url in all_files:
# splits the url string by '/', returns only the part after the last '/'
filename = url.rsplit('/', 1)[-1]
# same as before but split by '_' and getting only similar to 01-01-2020.csv
datestring = filename.rsplit('_', 1)[-1]
datestring_hyphens = datestring.count('-')
if datestring_hyphens == 1:
monthly.append(datestring)
elif date_string_hyphens == 2:
daily.append(datestring)
Upvotes: 0
Reputation: 8302
Here is a solution making use of regex
to identify daily and monthly date pattern's,
import re
daily_pattern = re.compile(r"\d{2}-\d{2}-\d{4}.csv")
monthly_pattern = re.compile(r"\d{2}-\d{4}.csv")
monthly, daily = [], []
for f in allFiles:
if daily_pattern.search(f):
daily.append(f)
elif monthly_pattern.search(f):
monthly.append(f)
else:
print('invalid pattern %s' % f)
Upvotes: 1