Reputation: 1678

Remove empty rows and empty [ ] using Python

I have 10,000 rows in my csv file. I want to remove empty bracket [] and rows which are empty [[]] and it is depicted in the following picture:

For instance the first cell in the first column :

[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

needs to be transformed into:

[['1', 2364, 2382, 1552, 1585],['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

and the row with only empty bracket:

[[]]    [[]]

needs to be removed from the file. As a result we get:

I tried:

df1 = df.Column_1.str.strip([]).str.split(',', expand=True)

My data are from string class

print(type(df.loc[0,'Column_1']))
<class 'str'>

print(type(df.loc[0,'Column_2']))
<class 'str'>

EDIT1 After executing the following code:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

it solves the problem. However l got some issue with comma (as a character and not a delimiter) ','

from the resulted line. I wanted to create a new csv file as follows:

columns =['char', 'left', 'right', 'top', 'down']

which corresponds for instance to:

'1' 2364 2382 1552 1585

to get a csv file as follow:

   char  left  top  right  bottom
0   'm'    38  104   2456    2492
1   'i'    40  102   2442     222
2   '.'   203  213    191     198
3   '3'   235  262    131    3333
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

so the whole code to get this is:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

cols = ['char','left','right','top','bottom']

df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)

However doing that I don't find any ',' in my file then it makes disorder in the new csv file rather getting:

',' 1491    1494    172 181

I got no comma ',' .and the disorder is explained in the following two lines:

 '    '     1491    1494    172
181  'r'    1508    1517    159

it should be:

',' 1491 1494 172 181
'r' 1508 1517 159 ... and so on

EDIT2

I'm trying to add 2 other column called line_number and all_chars_in_same_row

line_number corresponds to the line where for example

'm' 38 104 2456 2492

is extracted let say from line 2

all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance

character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

I get '1' '8' '4' '1' '7' and so on.

more formally: all_chars_in_same_row means: write all the character of the given row in line_number column

char  left  top  right  bottom     line_number  all_chars_in_same_row
0   'm'    38  104   2456    2492   from line 2  'm' '2' '5' 'g'
1   'i'    40  102   2442     222   from line 4
2   '.'   203  213    191     198   from line 6
3   '3'   235  262    131    3333  
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

The code related to that is:

import pandas as pd

    df_data=pd.read_csv('see2.csv', header=None, usecols=[1], names=['character_position'])
    df_data = df_data.positionlrtb.str.strip('[]').str.split(', ', expand=True)
    
    x=len(df_data.columns) #get total number of columns 
    #get all characters from every 5th column, concatenate and create new column in df_data
    df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
    # get index of each row. This is the line number for your record
    df_data[x+1]=df_data.index.get_level_values(0) 
     # now set line number and character columns as Index of data frame
    df_data.set_index([x+1,x],inplace=True,drop=True)
    
    df_data.columns = [df_data.columns % 5, df_data.columns // 5]
    
    df_data = df_data.stack()
    df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
    df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
    cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
    df_data.columns=cols
    df_data.reset_index(inplace=True) #remove mutiindexing
    print df_data[cols]

and output 

         char  left   top right bottom  from line all_chars_in_same_row
    0     '.'   203   213   191    198          0  ['.', '3', 'C']
    1     '3'  1758  1775   370    391          0  ['.', '3', 'C']
    2     'C'   296   305  1492   1516          0  ['.', '3', 'C']
    3     'A'   275   347   147    239          1  ['A', 'M', 'D']
    4     'M'  2166  2184   370    391          1  ['A', 'M', 'D']
    5     'D'   339   362  1815   1840          1  ['A', 'M', 'D']
    6     'A'    73    91   373    394          2  ['A', 'D', 'A']
    7     'D'  1395  1415   427    454          2  ['A', 'D', 'A']
    8     'A'  1440  1455  2047   2073          2  ['A', 'D', 'A']
    9     'D'   454   473   663    685          3  ['D', 'O', '0']
    10    'O'  1533  1545   487    541          3  ['D', 'O', '0']
    11    '0'   339   360  2137   2163          3  ['D', 'O', '0']
    12    'A'   108   129   727    751          4  ['A', 'V', 'I']
    13    'V'  1659  1677   490    514          4  ['A', 'V', 'I']
    14    'I'   339   360  1860   1885          4  ['A', 'V', 'I']
    15    'N'    34    51   949    970          5  ['N', '/', '2']
    16    '/'  1890  1904   486    505          5  ['N', '/', '2']
    17    '2'  1266  1283  1951   1977          5  ['N', '/', '2']
    18    'S'  1368  1401    43     85          6  ['S', 'A', '8']
    19    'A'  1344  1361   583    607          6  ['S', 'A', '8']
    20    '8'  2207  2217  1492   1515          6  ['S', 'A', '8']
    21    'S'  1437  1457   112    138          7  ['S', 'o', 'O']
    22    'o'  1548  1580   979   1015          7  ['S', 'o', 'O']
    23    'O'  1331  1349   370    391          7  ['S', 'o', 'O']
    24    'h'  1686  1703   315    339          8  ['h', 't', 't']
    25    't'   169   190  1291   1312          8  ['h', 't', 't']
    26    't'   169   190  1291   1312          8  ['h', 't', 't']
    27    'N'  1331  1349   370    391          9  ['N', 'C', 'C']
    28    'C'   296   305  1492   1516          9  ['N', 'C', 'C']
    29    'C'   296   305  1492   1516          9  ['N', 'C', 'C']

However, I got a strange results(order of letter, numbers, columns, headers..). I can't share them the file is too long. I tried to share it. but it exceeds the max characters.

where this line of code

df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

return None Value

  0      1      2      3      4     5      6      7      8      9     ...   \
0  'm'     38    104   2456   2492   'i'     40    102   2442   2448  ...    
1  '.'    203    213    191    198   '3'    235    262    131    198  ...    
2  'A'    275    347    147    239   'M'    363    465    145    239  ...    
3  'A'     73     91    373    394   'D'     93    112    373    396  ...    
4  'D'    454    473    663    685   'O'    474    495    664    687  ...    
5  'A'    108    129    727    751   'V'    129    150    727    753  ...    
6  'N'     34     51    949    970   '/'     52     61    948    970  ...    
7  'S'   1368   1401     43     85   'A'   1406   1446     43     85  ...    
8  'S'   1437   1457    112    138   'o'   1458   1476    118    138  ...    
9  'h'   1686   1703    315    339   't'   1706   1715    316    339  ...    
   1821  1822  1823  1824  1825  1826  1827  1828  1829  1830  
0  None  None  None  None  None  None  None  None  None  None  
1  None  None  None  None  None  None  None  None  None  None  
2  None  None  None  None  None  None  None  None  None  None  
3  None  None  None  None  None  None  None  None  None  None  
4  None  None  None  None  None  None  None  None  None  None  
5  None  None  None  None  None  None  None  None  None  None  
6  None  None  None  None  None  None  None  None  None  None

EDIT3 However, when I add page_number along with character_position

df1 = pd.DataFrame({
        "from_line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position)),
        "page_number" : np.repeat(df.index.values,df['page_number'])
})

I got the following error:

 File "/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

Upvotes: 1

Answers (4)

jezrael

Reputation: 862641

For lists you can use applymap with list comprehension for remove [] first and then remove all rows with boolean indexing, where mask check if all values in row is no 0 - empty lists.

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

If need remove row if any value is [[]]:

df1 = df1[~(df1.applymap(len).eq(0)).any(1)]

If values are strings:

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

and then dropna:

df1 = df1.dropna(how='all')

Or:

df1 = df1.dropna()

EDIT1:

df = pd.read_csv('see2.csv', index_col=0)

df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)

df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
      page_number                                       positionlrtb  \
0  1841729699_001  [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...   
1  1841729699_001   [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]   
2  1841729699_001  [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...   
3  1841729699_001  [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...   
4  1841729699_001  [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...   

                    LineIndex  
0      [[mi, il, mu, il, il]]  
1                      [[.3]]  
2                   [[amsun]]  
3  [[adresse, de, livraison]]  
4                [[document]]

cols = ['char','left','top','right','bottom']

df1 = pd.DataFrame({
        "a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
        "b": list(chain.from_iterable(df.positionlrtb))})

df1 = pd.DataFrame(df1.b.values.tolist())    
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)

print (df1)
     char  left   top  right  bottom
0       m    38   104   2456    2492
1       i    40   102   2442    2448
2       i    40   100   2402    2410
3       l    40   102   2372    2382
4       m    40   102   2312    2358
5       u    40   102   2292    2310
6       i    40   104   2210    2260
7       l    40   104   2180    2208
8       i    40   104   2140    2166

EDIT2:

#skip first row
df = pd.read_csv('see2.csv', usecols=[2], names=['character_position'], skiprows=1)
print (df.head())
                                  character_position
0  [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1  [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2  [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3  [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4  [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...

#convert to list, remove empty lists
df.character_position = df.character_position.apply(ast.literal_eval)
df.character_position = df.character_position.apply(lambda x: [y for y in x if len(y) > 0])

#new df - http://stackoverflow.com/a/42788093/2901002
df1 = pd.DataFrame({
        "from line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position))})

#filter by list comprehension string only, convert to tuple, because need create index 
df1['all_chars_in_same_row'] = 
df1['b'].apply(lambda x: tuple([y for y in x if isinstance(y, str)]))
df1 = df1.set_index(['from line','all_chars_in_same_row'])
#new df from column b
df1 = pd.DataFrame(df1.b.values.tolist(), index=df1.index)   
#Multiindex in columns
df1.columns = [df1.columns % 5, df1.columns // 5]
#reshape
df1 = df1.stack().reset_index(level=2, drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
#convert last columns to int
df1[cols[1:]] = df1[cols[1:]].astype(int)
df1 = df1.reset_index()
#convert tuples to list
df1['all_chars_in_same_row'] = df1['all_chars_in_same_row'].apply(list)

print (df1.head(15))
    from line           all_chars_in_same_row char  left  top  right  bottom
0           0  [m, i, i, l, m, u, i, l, i, l]    m    38  104   2456    2492
1           0  [m, i, i, l, m, u, i, l, i, l]    i    40  102   2442    2448
2           0  [m, i, i, l, m, u, i, l, i, l]    i    40  100   2402    2410
3           0  [m, i, i, l, m, u, i, l, i, l]    l    40  102   2372    2382
4           0  [m, i, i, l, m, u, i, l, i, l]    m    40  102   2312    2358
5           0  [m, i, i, l, m, u, i, l, i, l]    u    40  102   2292    2310
6           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2210    2260
7           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2180    2208
8           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2140    2166
9           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2124    2134
10          1                          [., 3]    .   203  213    191     198
11          1                          [., 3]    3   235  262    131     198
12          2                 [A, M, S, U, N]    A   275  347    147     239
13          2                 [A, M, S, U, N]    M   363  465    145     239
14          2                 [A, M, S, U, N]    S   485  549    145     243

Upvotes: 1

Ajax1234

Reputation: 71451

You could do this:

lst = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_lst = [i for i in lst if len(i) > 0]

Upvotes: 0

boot-scootin

Reputation: 12515

You could use a list comprehension for this:

arr = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

new_arr = [x for x in arr if x]

Or perhaps you prefer list + filter:

new_arr = list(filter(lambda x: x, arr))

The reason the lambda x: x works in this case is because that particular lambda is testing whether a given x in arr is "truthy." More specifically, that lambda will filter out elements in arr that are "falsey," like an empty list, []. It's almost like saying, "Keep everything in arr that 'exists'," so to speak.

Upvotes: 1

ivan7707

Reputation: 1156

new_list = []
for x in old_list:
    if len(x) > 0:
        new_list.append(x)

Upvotes: 0

Remove empty rows and empty [&#160;] using Python

Answers (4)

Related Questions

Remove empty rows and empty [ ] using Python