Reputation: 1
EDIT: Forgot to mention that this has to be done in pandas
I've got a little problem reading a certain file into a pandas dataframe. I've tried:
import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt',
delim_whitespace=True, header=None)
print(dataframe)
If I try it with the .txt containing something like "Hello this is a test" it works fine, but trying the actual readme.md I get errors saying:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-47-231496e21612> in <module>()
2 import matplotlib.pyplot as plt
3
----> 4 dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt', delim_whitespace=True, header=None)
5 print(dataframe)
~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
--> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.__name__ = name
~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
444
445 try:
--> 446 data = parser.read(nrows)
447 finally:
448 parser.close()
~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1034 raise ValueError('skipfooter not supported for iteration')
1035
-> 1036 ret = self._engine.read(nrows)
1037
1038 # May alter columns / col_dict
~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1846 def read(self, nrows=None):
1847 try:
-> 1848 data = self._reader.read(nrows)
1849 except StopIteration:
1850 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 10
I'm reading it into a dataframe so that I can count the amount of unique words and occurance of words in general. I'm sorry for this beginner question, but I've just started out with Python! Greetings.
Upvotes: 0
Views: 1545
Reputation: 8826
See if that helps:
>>> import pandas as pd
>>> dataframe = pd.read_table('README.md.1', skip_blank_lines=True)
>>> dataframe = dataframe.rename(columns={'# Tensorflow Object Detection API':'Tensorflow'}
>>> dataframe.head()
Tensorflow
0 Creating accurate machine learning models capa...
1 multiple objects in a single image remains a c...
2 The TensorFlow Object Detection API is an open...
3 TensorFlow that makes it easy to construct, tr...
4 models. At Google we’ve certainly found this ...
Upvotes: 0
Reputation: 1293
A pandas
dataframe is unsuitable for this task. You should just load the file, split by line and then aggregate counts from there. You can achieve this by reading the file, splitting by line and then flattening the resulting list. Finally you can then aggregate using Counter
from collections
.
from collections import Counter
with open("README.md") as f:
file_split = [line.split() for line in f]
file_split_flatten = [val for sublist in file_split for val in sublist]
count_dict = dict(zip(Counter(file_split_flatten).keys(), Counter(file_split_flatten).values()))
Then to access the count just do:
print(count_dict['Tensorflow'])
Upvotes: 1