Reputation: 17164
I have been trying to read a custom csv file like this:
6 Rotterdam NLD Zuid-Holland 593321
19 Zaanstad NLD Noord-Holland 135621
214 Porto Alegre BRA Rio Grande do Sul 1314032
397 Lauro de Freitas BRA Bahia 109236
547 Dobric BGR Varna 100399
552 Bujumbura BDI Bujumbura 300000
554 Santiago de Chile CHL Santiago 4703954
626 al-Minya EGY al-Minya 201360
646 Santa Ana SLV Santa Ana 139389
762 Bahir Dar ETH Amhara 96140
123 Chicago 10000
222 New York 200000
I tried regex in https://regex101.com/ Following code works:
# https://regex101.com/
s = "6 Rotterdam NLD Zuid-Holland 593321 "
pat = r'(\d+)\s+([\D]+)\s(\d+)\s+'
m = re.match(pat,s)
m.groups() # ('6', 'Rotterdam NLD Zuid-Holland', '593321')
I got the correct answer, but when I applied the code to pandas read_csv, somehow it failed to work.
import numpy as np
import pandas as pd
from io import StringIO
s = """6 Rotterdam NLD Zuid-Holland 593321
19 Zaanstad NLD Noord-Holland 135621
214 Porto Alegre BRA Rio Grande do Sul 1314032
397 Lauro de Freitas BRA Bahia 109236
547 Dobric BGR Varna 100399
552 Bujumbura BDI Bujumbura 300000
554 Santiago de Chile CHL Santiago 4703954
626 al-Minya EGY al-Minya 201360
646 Santa Ana SLV Santa Ana 139389
762 Bahir Dar ETH Amhara 96140
123 Chicago 10000
222 New York 200000 """;
sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+'
df = pd.read_csv(StringIO(s), sep=sep,engine='python')
df
I get a lot of Nans, how to get only 3 columns?
Column names are: ID CITY POPULATION
Upvotes: 1
Views: 1637
Reputation: 2526
Just to provide an alternative solution without the use of regular expressions:
You could also do the parsing of the text file in plain Python. In some cases this may be easier to maintain than a rather complex regular expression.
For this specific format we know that the first and the last number in every row have a special meaning. So I would have used split
and rsplit
to pick them.
import pandas as pd
from io import StringIO
s = """6 Rotterdam NLD Zuid-Holland 593321
19 Zaanstad NLD Noord-Holland 135621
214 Porto Alegre BRA Rio Grande do Sul 1314032
397 Lauro de Freitas BRA Bahia 109236
547 Dobric BGR Varna 100399
552 Bujumbura BDI Bujumbura 300000
554 Santiago de Chile CHL Santiago 4703954
626 al-Minya EGY al-Minya 201360
646 Santa Ana SLV Santa Ana 139389
762 Bahir Dar ETH Amhara 96140
123 Chicago 10000
222 New York 200000 """
data = []
for line in StringIO(s):
line = line.strip()
if not line:
continue
id_value, line = line.split(" ", 1)
city, population = line.rsplit(" ", 1)
data.append((id_value, city, population))
df = pd.DataFrame(data, columns=["id", "city", "population"])
df["id"] = pd.to_numeric(df["id"])
df["population"] = pd.to_numeric(df["population"])
print(df)
I did not do any speed measurements. Depending on the file size however, speed might not be an issue at all, anyhow. But even if it is: I would have used this script to preprocess the data first (and one time only) in order to be able to use regular old pd.read_csv
without additional arguments needed.
Upvotes: 1
Reputation: 627607
You used the pattern to match (extract) text but in the pandas method you are splitting with the pattern.
If there can be 1, 2 or 3 digits only at the start of each line, use
sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'
See the regex demo. You may extend it by adding more lookbehinds in the first non-capturing group.
Details
(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+
- 1+ whitespaces (\s+
) that are preceded with 1 digit (\d
), or 2 digits (\d{2}
), or 3 digits (\d{3}
) at the start of string (^
)|
- or\s+(?=\S+\s*$)
- 1+ whitespaces followed with 1+ non-whitespace chars and then any trailing 0+ whitespaces before end of string.Upvotes: 2