Sebastian Goslin
Sebastian Goslin

Reputation: 497

Formatting an unstructured csv in pandas

I'm having an issue reading in accurate information from archived 4chan comments. Since the structure of a thread of a 4chan thread doesn't (seem to) translate very well into a rectangular dataframe I'm having issues actually getting the appropriate comments from each thread into a single row in pandas.

To exacerbate the problem the dataset is 54GB in size and I asked a similar question on how to just read the data into a pandas dataframe (in which the solution to that problem made me realize this issue) which makes diagnosing every problem tedious.

The code I use to read in portions of the data is as follows:

def Four_pleb_chunker():
    """
    :return: 4pleb data is over 54 GB so this chunks it into something manageable
    """
    with open('pol.csv') as f:
        with open('pol_part.csv', 'w') as g:
            for i in range(1000):   ready
                g.write(f.readline())

    name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w', 'preview_h',
            'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig', 'spoiler', 'deleted', 'capcode',
            'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash', 'poster_country', 'exif']

    cols = ['num','timestamp', 'email', 'name', 'title', 'comment', 'poster_country']

    df_chunk = pd.read_csv('pol_part.csv',
                           names=name_cols,
                           delimiter=None,
                           usecols=cols,
                           skip_blank_lines=True,
                           engine='python',
                           error_bad_lines=False)

    df_chunk = df_chunk.rename(columns={"comment": "Comments"})
    df_chunk = df_chunk.dropna(subset=['Comments'])
    df_chunk['Comments'] = df_chunk['Comments'].str.replace('[^0-9a-zA-Z]+', ' ')

    df_chunk.to_csv('pol_part_df.csv')

    return df_chunk

This code works fine, however due to the structure of each thread a parser that I wrote sometimes returns nonsensical results. In csv form this is what the first few rows of the dataset look like (pardon the screen shot, its extremely difficult to actually write all those lines out using this UI.)

screen shot of the data

As it can be seen the comments per thread are split by '\' but then each comment doesn't take its own row. My goal is at least to get each comment into its own row so I can parse through it correctly. However the function I'm using to parse the data cuts off after 1000 iterations regardless if its a new line or not.

Fundamentally my questions are: How can I structure this data to actually read the comments accurately, and be able to read in a complete sample dataframe as opposed to a truncated one. As for solutions I've tried:

df_chunk = pd.read_csv('pol_part.csv',
                               names=name_cols,
                               delimiter='',
                               usecols=cols,
                               skip_blank_lines=True,
                               engine='python',
                               error_bad_lines=False)

If I get rid of/change the argument delimiter I get this error:

Skipping line 31473: ',' expected after '"'

Which makes sense because the data isn't separated by , so it skips every line that doesn't fit that condition, in this case the whole dataframe. inputing \ into the argument gives me a syntax error. I'm kind of at a loss for what to do next, so if anyone has had any experience dealing with an issue like this you'd be a lifesaver. Let me know if there isn't something I've included in here and I'll update the post.

Update, here are some sample lines from the CSV for testing:

2   23594708    1385716767  \N  Anonymous   \N  Example: not identifying the fundamental scarcity of resources which underlies the entire global power structure, or the huge, documented suppression of any threats to that via National Security Orders. Or that EVERY left/right ideology would be horrible in comparison to ANY in which energy scarcity and the hierarchical power structures dependent upon it had been addressed.
3   23594754    1385716903  \N  Anonymous   \N  ">>23594701\
                                                 \
                                                  No, /pol/ is bait. That's the point."
4   23594773    1385716983  \N  Anonymous   \N  ">>23594754
                                                 \
                                                 Being a non-bait among baits is equal to being a bait among non-baits."
5   23594795    1385717052  \N  Anonymous   \N  Don't forget how heavily censored this board is! And nobody has any issues with that.
6   23594812    1385717101  \N  Anonymous   \N  ">>23594773\
                                                 \
                                                 Clever. The effect is similar. But there are minds on /pol/ who don't WANT to be bait, at least."

Upvotes: 0

Views: 355

Answers (1)

rje
rje

Reputation: 6428

Here's a sample script that converts your csv into separate lines for each comment:

import csv

# open file for output and create csv writer
f_out = open('out.csv', 'w')
w = csv.writer(f_out)

# open input file and create reader
with open('test.csv') as f:
    r = csv.reader(f, delimiter='\t')
    for l in r:
        # skip empty lines
        if not l:
            continue
        # in this line I want to split the last part 
        # and loop over each resulting string
        for s in l[-1].split('\\\n'):
            # we copy all fields except the last one
            output = l[:-1]
            # add a single comment
            output.append(s)
            w.writerow(output)

Upvotes: 1

Related Questions