user1355603
user1355603

Reputation: 305

How to efficiently search for strings in a file?

I have 10GB data in the following form:

A=good
B=c++

Now I wish to find out B's . For example I wish to find out "c++", since for this case ...the approach I am following for this problem is to pick the B part (i.e. the second line first) and from there on find out a string which is equal to B's string. Then in second round of loop..I am looking for another value of B (now the 4th line) and from there find a B which has an equal string....and so on

However, the above approach takes a lot of time, is there some other approach in Python to solve this problem efficiently.

Upvotes: 0

Views: 203

Answers (4)

thavan
thavan

Reputation: 2429

Since this data is too big, I would suggest to store it in database like mysql. Then your problem is solved with a single line of query.

select * from t1,t2 where t1.a=t2.b;

this is an alternative suggestion. If u choose to go, mysqldb module can help you to connect python and mysql.

Upvotes: 1

Running Wild
Running Wild

Reputation: 3011

Run this:

cat huge_file | awk 'BEGIN {FS = "="} { print $2 "***" $1 }' | sort -n | awk 'BEGIN {FS = "\\*\\*\\*"} { if (prev == $1 && $2 == "B") { print $1 } prev = $1 }'

This splits them into A/B and value, sorts by value, and finds adjacent pairs. It makes the assumption that none of the strings have the substring "*", but you could replace that with any other substring you know won't show up.

Upvotes: 0

Gareth Latty
Gareth Latty

Reputation: 88997

The best way of doing this is to read the data in, constructing a set of A items, and a set of B items. Then you simply find the intersection between the two.

The only potential downside is you need to fit all of the data into memory at once. Given your large dataset, this could be a problem. If you could handle half, then you could create your set of A items, then work through the B items checking against the set.

Example:

Using the input data:

A=good
B=c++
A=df
B=kj
A=c++
B=programming language

The first method can be done simply like so:

a = set()
b = set()
with open("test") as data:
    for line in data:
        line_data = line[2:].strip()
        if line.startswith("A"):
            a.add(line_data)
        else:
            b.add(line_data)

print(a & b)

Giving us:

{'c++'}

The second method can be done like so:

with open("test") as data:
    a = {line[2:].strip() for line in data if line.startswith("A")}

with open("test") as data:
    results = {item for item in (line[2:].strip() for line in data if line.startswith("B")) if item in a}

print(results)

This gives the same results, while only involving storing half of the data in memory (or less if there is significant duplication of data), and is still far more efficient due to the efficient nature of set lookups.

Upvotes: 2

Danica
Danica

Reputation: 28846

Since your file is too big to easily fit in memory, how about:

  1. Split into two files, As and Bs
  2. Sort each (e.g. with unix sort or a Python external-memory mergesort)
  3. Do the merge step of mergesort to find duplicates

Upvotes: 8

Related Questions