Reputation: 1812
I'm trying to save files to a directory after scraping them from the web using scrapy. I'm extracting a date from the file and using that as the file name. The problem I'm running into, however, is that some files have the same date, i.e. there are two files that would take the name "June 2, 2009". So, what I'm looking to do is somehow check whether there is already a file with the same name, and if so, name it something like "June 2, 2009.1" or some such.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
date = hxs.select("//div[@id='content']").extract()[0]
dateStrip = re.search(r"([A-Z]*|[A-z][a-z]+)\s\d*\d,\s[0-9]+", date)
newDate = dateStrip.group()
content = hxs.select("//div[@id='content']")
content = content.select('string()').extract()[0]
filename = ("/path/to/a/folder/ %s.txt") % (newDate)
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)
Upvotes: 1
Views: 592
Reputation: 1812
The other answer pointed me in the correct direction by checking into the os tools in python, but I think the way I found is perhaps more straightforward. Reference here How do I check whether a file exists using Python? for more.
The following is the code I came up with:
existence = os.path.isfile(filename)
if existence == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)
else:
newFilename = ("/path/.../.../- " + '%s' ".1.txt") % (newDate)
with codecs.open(newFilename, 'w', encoding='utf-8') as output:
output.write(content)
Edited to Add:
I didn't like this solution too much, and thought the other answer's solution was probably better but didn't quite work. The main part I didn't like about my solution was that it only worked with 2 files of the same name; if three or four files had the same name the initial problem would occur. The following is what I came up with:
filename = ("/Users/path/" + " " + "title " + '%s' + " " + "-1.txt") % (date)
filename = str(filename)
while True:
os.path.isfile(filename)
newName = filename.replace(".txt", "", filename)
newName = str.split(newName)
newName[-1] = str(int(newName[-1]) + 1)
filename = " ".join(newName) + ".txt"
if os.path.isfile(filename) == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
break
It probably isn't the most elegant and might be kind of a hackish approach, but it has worked so far and seems to have addressed my problem.
Upvotes: 0
Reputation: 749
You can use os.listdir to get a list of existing files and allocate a filename that will not cause conflict.
import os
def get_file_store_name(path, fname):
count = 0
for f in os.listdir(path):
if fname in f:
count += 1
return os.path.join(path, fname+str(count))
# This is example to use
print get_file_store_name(".", "README")+".txt"
Upvotes: 1
Reputation: 4085
one other solution is you can append time with date, for naming file like
from datetime import datetime
filename = ("/path/to/a/folder/ %s_%s.txt") % (newDate,datetime.now().strftime("%H%M%S"))
Upvotes: 0
Reputation: 76725
The usual way to check for existence of a file in the C library is with a function called stat()
. Python offers a thin wrapper around this function in the form of os.stat()
. I suggest you use that.
http://docs.python.org/library/stat.html
def file_exists(fname):
try:
stat_info = os.stat(fname)
if os.S_ISREG(stat_info): # true for regular file
return True
except Exception:
pass
return False
Upvotes: 0