Reputation: 626
I have an input file1
of two columns(tab-separted):
c1\tc2
aaa\t232 65 19 32
bbew\t32 22 20
jhsi\t986 1 32 463 221
And input file2
which have one column:
c1
19
1
32
277
what I want is to search an element from file2
in file1
, and return the corresponding value in c1
. If there is more than one matched value, then return all together in one column.
Here is what should the output file be like:
19 aaa
1 jhsi
32 aaa bbew jhsi
277
277
would be left empty because it does not exist.
Any suggestion will be helpful.
Upvotes: 1
Views: 2838
Reputation: 164843
This isn't easily vectorisable. For performance, I suggest you perform your transformation before you put data in a Pandas dataframe. Here is a solution using collections.defaultdict
:
# use set for O(1) lookup
scope_set = set(df2['c1'])
# initialise defualtdict of lists
dd = defaultdict(list)
# iterate and create dictionary mapping numbers to keys
for row in df1.itertuples(index=False):
for num in map(int, row.c2.split()):
if num in scope_set:
dd[num].append(row.c1)
# construct dataframe from defaultdict
df = pd.DataFrame({'num': list(dd), 'keys': list(map(' '.join, dd.values()))})
# reindex to include blanks
df = df.set_index('num').reindex(sorted(scope_set)).reset_index()
Result
print(df)
num keys
0 1 jhsi
1 19 aaa
2 32 aaa bbew jhsi
3 277 NaN
Setup
from io import StringIO
from collections import defaultdict
file1 = StringIO("""c1\tc2
aaa\t232 65 19 32
bbew\t32 22 20
jhsi\t986 1 32 463 221""")
file2 = StringIO("""c1
19
1
32
277""")
df1 = pd.read_csv(file1, sep='\t')
df2 = pd.read_csv(file2)
Upvotes: 2