Indie
Indie

Reputation: 73

Python - Splitting a large string by number of delimiter occurrences

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.

For instance:

ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//

In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.

So an input of splitting the string by // by 1 would return:

ABCDEF

an input of splitting the string by // by 2 would return:

ABCDEF
//
GHIJKLMN

an input of splitting the string by // by 3 would return:

ABCDEF
//
GHIJKLMN
//
OPQ

And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.

I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.

Is there a simple or elegant way to achieve this? Thanks!

Upvotes: 7

Views: 1546

Answers (5)

Brent Washburne
Brent Washburne

Reputation: 13158

If you want to work with files instead of strings in memory, here is another answer.

This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).

def file_split(file_name, delimiter, n=1):
    with open(file_name) as fh:
        for line in fh:
            line = line.rstrip()    # use .rstrip("\n") to only strip newlines
            if line == delimiter:
                n -= 1
                if n <= 0:
                    return
            print line

file_split('data.txt', '//', 3)

You can use this to write the output to a new file like this:

python split.py > newfile.txt

With a little extra work, you can use argparse to pass parameters to the program.

Upvotes: 1

Shalom Ray
Shalom Ray

Reputation: 25

Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.

Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.

num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
    print ([next(myfile) for x in range(num-1)])

Now that have the idea, you can use pattern matching and so on.

Upvotes: 0

sheh
sheh

Reputation: 1023

For instance:

   i = 0
   s = ""
   fd = open("...")
   for l in fd:
       if l[:-1] == delimiter:  # skip last '\n'
          i += 1
       if i >= max_split:
          break
       s += l
   fd.close()

Upvotes: 0

Headhunter Xamd
Headhunter Xamd

Reputation: 606

The method that comes to my mind when I read your question uses a for loop where you cut up the string into several (for example the 100 you called) and iterate through the substring.

thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
    if(element you want):
        #do your thing with the line
    else:
        log = log+steps
        # and go again from the start only with this offset

now you can go through all the elements go through the whole 2 million(!) line string.

best thing to do here is actually make a recursive function from this(if that is what you want):

 thestring = "" #your string
 steps = 100 #length of the strings you are going to use for iteration

 def iterateThroughHugeString(beginning):
     substring = thestring[:beginning+steps] #this is the string you will split and iterate through
     thelist = substring.split("//")
     for element in thelist:
         if(element you want):
             #do your thing with the line
         else:
             iterateThroughHugeString(beginning+steps)
             # and go again from the start only with this offset

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107347

As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:

from itertools import islice
with open('filename') as f :
   lines = islice(f,0,2*N-1)

Upvotes: 0

Related Questions