Reputation: 738
I am trying to implement an efficient bitwise majority function between columns of a dataframe.
To make things simple, I am showing a transposed column below (columns are 0,1,2,3 and one particular row A).
A
+-----+
0 | 000 |
+-----+
1 | 111 |
+-----+
2 | 001 |
+-----+
3 | 001 |
+-----+
+-----+
Output| 001 |
+-----+
The calculation is done by finding the most repeated bit value in each position. For example, the LSB values are [0,1,1,1] so the returned LSB is 1. Similarly the other two bits are calculated to be 0 and 0.
What is the best way to compute this majority function? Does the method to calculate the majority differ if the values are stored as integers?
Upvotes: 0
Views: 488
Reputation: 26
Second edit: It is actually easier if you do not split the digits into a list, but to access the i-th character of a string via df.str.get()
:
df.T.apply(lambda row: ''.join([str(int(row.str.get(i).astype(int).mean() >= 0.5)) for i in range(3)]))
If you have your numbers as integers instead of strings, you just have to replace the method to extract the i-th digit:
n_digits = 3
df.T.apply(lambda row: ''.join([str(int(((row // 2**i) % 2).mean() >= 0.5)) for i in range(n_digits-1, -1, -1)]))
Old answer: Convert each entry to a list of integers, check if the mean is at least 0.5, and join the resulting list of Boolean values back to a string of zeros and ones.
df = pd.DataFrame([['000','111','001','001'],['111','111','101','001']], columns=['0','1','2','3'], index=['A','B'])
(df.T.apply(lambda row:
(row.apply(lambda x: pd.Series(list(x))).astype(int).mean() >= 0.5)
.astype(int))
.astype(str)
.apply(lambda x: ''.join(x)))
Edit: Let's have a closer look at the code from the inside out: The variable x
is the binary representation of a number as a string. It first gets transformed to a list of single characters, then to a Series of single characters, and then to a Series of integers:
x = '001'
print(list(x))
print(pd.Series(list(x)))
print(pd.Series(list(x)).astype(int))
>>>
['0', '0', '1']
0 0
1 0
2 1
dtype: object
0 0
1 0
2 1
dtype: int32
We use this transformation for a whole row (which is a column of df.T
, remember that apply
works on columns by default):
row = df.loc['A']
print(row.apply(lambda x: pd.Series(list(x))).astype(int))
>>>
0 1 2
0 0 0 0
1 1 1 1
2 0 0 1
3 0 0 1
Next comes the majority function: The i-th digit should be 1 if at least 50% of the entries of a column are 1. We can check this by computing the mean of the i-th column and comparing it to 0.5:
print(df.T.apply(lambda row: row.apply(lambda x: pd.Series(list(x))).astype(int).mean() >=0.5))
>>>
A B
0 False True
1 False True
2 True True
The rest of the code converts each column, which is basically a list of Boolean values, back to a list of integers, then to a list of strings, and finally to a single string, so [False, False, True]
becomes [0, 0, 1]
, which becomes ['0', '0', '1']
, which is joined to '001'
.
Upvotes: 1