Reputation: 305
I have 10GB data in the following form:
A=good
B=c++
Now I wish to find out B's . For example I wish to find out "c++", since for this case ...the approach I am following for this problem is to pick the B part (i.e. the second line first) and from there on find out a string which is equal to B's string. Then in second round of loop..I am looking for another value of B (now the 4th line) and from there find a B which has an equal string....and so on
However, the above approach takes a lot of time, is there some other approach in Python to solve this problem efficiently.
Upvotes: 0
Views: 203
Reputation: 2429
Since this data is too big, I would suggest to store it in database like mysql. Then your problem is solved with a single line of query.
select * from t1,t2 where t1.a=t2.b;
this is an alternative suggestion. If u choose to go, mysqldb module can help you to connect python and mysql.
Upvotes: 1
Reputation: 3011
Run this:
cat huge_file | awk 'BEGIN {FS = "="} { print $2 "***" $1 }' | sort -n | awk 'BEGIN {FS = "\\*\\*\\*"} { if (prev == $1 && $2 == "B") { print $1 } prev = $1 }'
This splits them into A/B and value, sorts by value, and finds adjacent pairs. It makes the assumption that none of the strings have the substring "*", but you could replace that with any other substring you know won't show up.
Upvotes: 0
Reputation: 88997
The best way of doing this is to read the data in, constructing a set of A
items, and a set of B
items. Then you simply find the intersection between the two.
The only potential downside is you need to fit all of the data into memory at once. Given your large dataset, this could be a problem. If you could handle half, then you could create your set of A
items, then work through the B
items checking against the set.
Example:
Using the input data:
A=good
B=c++
A=df
B=kj
A=c++
B=programming language
The first method can be done simply like so:
a = set()
b = set()
with open("test") as data:
for line in data:
line_data = line[2:].strip()
if line.startswith("A"):
a.add(line_data)
else:
b.add(line_data)
print(a & b)
Giving us:
{'c++'}
The second method can be done like so:
with open("test") as data:
a = {line[2:].strip() for line in data if line.startswith("A")}
with open("test") as data:
results = {item for item in (line[2:].strip() for line in data if line.startswith("B")) if item in a}
print(results)
This gives the same results, while only involving storing half of the data in memory (or less if there is significant duplication of data), and is still far more efficient due to the efficient nature of set lookups.
Upvotes: 2
Reputation: 28846
Since your file is too big to easily fit in memory, how about:
sort
or a Python external-memory mergesort)Upvotes: 8