L. Meister
L. Meister

Reputation: 1

Trouble reading a readme.md with pandas

EDIT: Forgot to mention that this has to be done in pandas

I've got a little problem reading a certain file into a pandas dataframe. I've tried:

import pandas as pd
import matplotlib.pyplot as plt

dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt', 
delim_whitespace=True, header=None)
print(dataframe)

If I try it with the .txt containing something like "Hello this is a test" it works fine, but trying the actual readme.md I get errors saying:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-47-231496e21612> in <module>()
      2 import matplotlib.pyplot as plt
      3 
----> 4 dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt', delim_whitespace=True, header=None)
      5 print(dataframe)

~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    676                     skip_blank_lines=skip_blank_lines)
    677 
--> 678         return _read(filepath_or_buffer, kwds)
    679 
    680     parser_f.__name__ = name

~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    444 
    445     try:
--> 446         data = parser.read(nrows)
    447     finally:
    448         parser.close()

~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   1034                 raise ValueError('skipfooter not supported for iteration')
   1035 
-> 1036         ret = self._engine.read(nrows)
   1037 
   1038         # May alter columns / col_dict

~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   1846     def read(self, nrows=None):
   1847         try:
-> 1848             data = self._reader.read(nrows)
   1849         except StopIteration:
   1850             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 10

I'm reading it into a dataframe so that I can count the amount of unique words and occurance of words in general. I'm sorry for this beginner question, but I've just started out with Python! Greetings.

Upvotes: 0

Views: 1545

Answers (2)

Karn Kumar
Karn Kumar

Reputation: 8826

See if that helps:

>>> import pandas as pd
>>> dataframe  = pd.read_table('README.md.1', skip_blank_lines=True)
>>> dataframe = dataframe.rename(columns={'# Tensorflow Object Detection API':'Tensorflow'}
>>> dataframe.head()
                                          Tensorflow
0  Creating accurate machine learning models capa...
1  multiple objects in a single image remains a c...
2  The TensorFlow Object Detection API is an open...
3  TensorFlow that makes it easy to construct, tr...
4  models.  At Google we’ve certainly found this ...

Upvotes: 0

Sam Comber
Sam Comber

Reputation: 1293

A pandas dataframe is unsuitable for this task. You should just load the file, split by line and then aggregate counts from there. You can achieve this by reading the file, splitting by line and then flattening the resulting list. Finally you can then aggregate using Counter from collections.

from collections import Counter

with open("README.md") as f:
    file_split = [line.split() for line in f]

file_split_flatten = [val for sublist in file_split for val in sublist]

count_dict = dict(zip(Counter(file_split_flatten).keys(), Counter(file_split_flatten).values()))

Then to access the count just do:

print(count_dict['Tensorflow'])

Upvotes: 1

Related Questions