BhishanPoudel
BhishanPoudel

Reputation: 17164

Pandas read csv with regex separator

I have been trying to read a custom csv file like this:

6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  

I tried regex in https://regex101.com/ Following code works:

this works

# https://regex101.com/
s = "6 Rotterdam NLD Zuid-Holland 593321 "
pat = r'(\d+)\s+([\D]+)\s(\d+)\s+'

m = re.match(pat,s)
m.groups() # ('6', 'Rotterdam NLD Zuid-Holland', '593321')

I got the correct answer, but when I applied the code to pandas read_csv, somehow it failed to work.

my attempt

import numpy as np
import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """;

sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+'
df = pd.read_csv(StringIO(s), sep=sep,engine='python')
df

I get a lot of Nans, how to get only 3 columns?

Column names are: ID CITY POPULATION

similar questions

Upvotes: 1

Views: 1637

Answers (2)

Lydia van Dyke
Lydia van Dyke

Reputation: 2526

Just to provide an alternative solution without the use of regular expressions:

You could also do the parsing of the text file in plain Python. In some cases this may be easier to maintain than a rather complex regular expression.

For this specific format we know that the first and the last number in every row have a special meaning. So I would have used split and rsplit to pick them.

import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """

data = []
for line in StringIO(s):
    line = line.strip()
    if not line:
        continue
    id_value, line = line.split(" ", 1)
    city, population = line.rsplit(" ", 1)

    data.append((id_value, city, population))

df = pd.DataFrame(data, columns=["id", "city", "population"])
df["id"] = pd.to_numeric(df["id"])
df["population"] = pd.to_numeric(df["population"])
print(df)

I did not do any speed measurements. Depending on the file size however, speed might not be an issue at all, anyhow. But even if it is: I would have used this script to preprocess the data first (and one time only) in order to be able to use regular old pd.read_csv without additional arguments needed.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You used the pattern to match (extract) text but in the pandas method you are splitting with the pattern.

If there can be 1, 2 or 3 digits only at the start of each line, use

sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'

See the regex demo. You may extend it by adding more lookbehinds in the first non-capturing group.

Details

  • (?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+ - 1+ whitespaces (\s+) that are preceded with 1 digit (\d), or 2 digits (\d{2}), or 3 digits (\d{3}) at the start of string (^)
  • | - or
  • \s+(?=\S+\s*$) - 1+ whitespaces followed with 1+ non-whitespace chars and then any trailing 0+ whitespaces before end of string.

Upvotes: 2

Related Questions