Etienne Caronan
Etienne Caronan

Reputation: 1

Using pandas to read a data file with no structure (no header row and rows of different lengths)

I am reading the data from a .dat file

And here's an example of what the dataset looks like

38 39 41 109 110 
39 111 112 113 114 115 116 117 118 
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 
48 134 135 136 
39 48 137 138 139 140 141 142 143 144 145 146 147 148 149 

What I'm trying to do is to read the data file and get a random row from it like

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 

I've been doing this:

    data_url = "someurl.dat"

    market_basket = pd.read_csv(data_url, header=None, delimiter='\n+', engine="python")
    sample = market_basket.sample(n=1)

But when I output the value of sample, this is what I get:

                                  0
40911  39 2787 2858 5016 5041 13569

Moreso, when I look for the outputted row, I can't find it in my dataset why?

Upvotes: 0

Views: 282

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 149075

This is a pandas variation on Rafaël's answer.

Pandas read_csv can read one single line from a file, thanks to the skiprows and nrows parameters. The hard part is in fact how to find a random line number...

So a simple way is to read all lines from the input file, choose a random one and feed that single line into the dataframe:

import pandas as pd
import random
import io

with open("someurl.dat") as fd:
    line = random.choice(fd.readlines)

df = pd.read_csv(io.StringIO(line), sep='\s+', header=None)

BTW, your code cannot give you the expected dataframe. With

market_basket = pd.read_csv(data_url, header=None, delimiter='\n+', engine="python")
sample = market_basket.sample(n=1)

market_basket is a DataFrame with one single columns containing the full lines, indexed by their line number in the file. So sample is the 40911th line, containing 39 2787 2858 5016 5041 13569. To parse it, you still need tp first extract the actual field (.iloc[0][0]) and split it:

sample = pd.read_csv(io.StringIO(sample.iloc[0][0]), sep='\s+', header=None)

Upvotes: 1

Rafaël Dera
Rafaël Dera

Reputation: 409

Why the Pandas? Can you simply open the file with plain python?

Something like:

import random
with open(filename) as a:
    data = a.read().splitlines()
line = random.choice(data)

Upvotes: 1

Related Questions