Reputation: 137
I have a few csv files which contain a pair of bearings for many locations. I am trying to expand the values to include every number between the bearing pairs for each location and export the variable lengths as a csv in the same format.
Example:
df = pd.read_csv('bearing.csv')
Data structure:
A B C D E
0 0 94 70 67 84
1 120 132 109 152 150
Ideal result is a variable length multidimensional array:
A B C D E
0 0 94 70 67 84
1 1 95 71 68 85
2 3 96 72 69 86
...
n 120 132 109 152 150
I am looping through each column and getting the range of the pair of values, but I am struggling when trying to overwrite the old column with the new range of values.
for col in bear:
min_val = min(bear[col])
max_val = max(bear[col])
range_vals = range(min(bear[col]), max(bear[col])+1)
bear[col] = range_vals
I am getting the following error:
ValueError: Length of values does not match length of index
Upvotes: 3
Views: 3116
Reputation: 862481
You can use dict
comprehension with min
and max
in DataFrame
contructor, but you get a lot NaN
in the end of columns:
df = pd.DataFrame({col: pd.Series(range(df[col].min(),
df[col].max() + 1)) for col in df.columns })
print (df)
print (df)
A B C D E
0 0 94.0 70.0 67.0 84.0
1 1 95.0 71.0 68.0 85.0
2 2 96.0 72.0 69.0 86.0
3 3 97.0 73.0 70.0 87.0
4 4 98.0 74.0 71.0 88.0
5 5 99.0 75.0 72.0 89.0
6 6 100.0 76.0 73.0 90.0
7 7 101.0 77.0 74.0 91.0
8 8 102.0 78.0 75.0 92.0
9 9 103.0 79.0 76.0 93.0
10 10 104.0 80.0 77.0 94.0
11 11 105.0 81.0 78.0 95.0
12 12 106.0 82.0 79.0 96.0
13 13 107.0 83.0 80.0 97.0
14 14 108.0 84.0 81.0 98.0
15 15 109.0 85.0 82.0 99.0
16 16 110.0 86.0 83.0 100.0
17 17 111.0 87.0 84.0 101.0
18 18 112.0 88.0 85.0 102.0
19 19 113.0 89.0 86.0 103.0
20 20 114.0 90.0 87.0 104.0
21 21 115.0 91.0 88.0 105.0
22 22 116.0 92.0 89.0 106.0
23 23 117.0 93.0 90.0 107.0
24 24 118.0 94.0 91.0 108.0
25 25 119.0 95.0 92.0 109.0
26 26 120.0 96.0 93.0 110.0
27 27 121.0 97.0 94.0 111.0
28 28 122.0 98.0 95.0 112.0
29 29 123.0 99.0 96.0 113.0
.. ... ... ... ... ...
91 91 NaN NaN NaN NaN
92 92 NaN NaN NaN NaN
93 93 NaN NaN NaN NaN
94 94 NaN NaN NaN NaN
95 95 NaN NaN NaN NaN
96 96 NaN NaN NaN NaN
97 97 NaN NaN NaN NaN
98 98 NaN NaN NaN NaN
99 99 NaN NaN NaN NaN
100 100 NaN NaN NaN NaN
101 101 NaN NaN NaN NaN
102 102 NaN NaN NaN NaN
103 103 NaN NaN NaN NaN
104 104 NaN NaN NaN NaN
105 105 NaN NaN NaN NaN
106 106 NaN NaN NaN NaN
107 107 NaN NaN NaN NaN
108 108 NaN NaN NaN NaN
109 109 NaN NaN NaN NaN
110 110 NaN NaN NaN NaN
111 111 NaN NaN NaN NaN
112 112 NaN NaN NaN NaN
113 113 NaN NaN NaN NaN
114 114 NaN NaN NaN NaN
115 115 NaN NaN NaN NaN
116 116 NaN NaN NaN NaN
117 117 NaN NaN NaN NaN
118 118 NaN NaN NaN NaN
119 119 NaN NaN NaN NaN
120 120 NaN NaN NaN NaN
If you have only few columns, is possible use:
df = pd.DataFrame({'A': pd.Series(range(df.A.min(), df.A.max() + 1)),
'B': pd.Series(range(df.B.min(), df.B.max() + 1))})
EDIT:
If min
value is in first row and the max
in last, you can use iloc
:
df = pd.DataFrame({col: pd.Series(range(df[col].iloc[0],
df[col].iloc[-1] + 1)) for col in df.columns })
Timings:
In [3]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].iloc[0], df[col].iloc[-1] + 1)) for col in df.columns }) )
1000 loops, best of 3: 1.75 ms per loop
In [4]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].min(), df[col].max() + 1)) for col in df.columns }) )
The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.18 ms per loop
Upvotes: 3