Pake
Pake

Reputation: 1118

How can I summarize character columns in my dataframe in R?

With numeric columns and factor columns, summary() provides some information useful in understanding the data. For example, this output using the iris dataset: enter image description here

Here, we see min, 1st quartile, median, mean, 3rd quartile, and max for the numeric columns, which is helpful for a quick spot-check. We also see counts on the factor column.

Running the following code just to create an all-character-column data frame and checking summary(), we get a result that isn't very helpful as a summary of the values in my data (at least for the purposes that I'm interested in).

  iris2<-iris%>%
        mutate_all(as.character)

summary(iris2)

enter image description here

In general, I'd like to have something more like the results I get for factor columns when I use summary() with character columns.

I realize that I can convert my character columns to factor and then run summary() with something like the below:

  iris3<-iris2%>%
    mutate_all(as.factor)

  summary(iris3)

enter image description here

Is there a way that I can avoid having to make the extra step in order to spot-check my data? I ultimately want to keep working with the data as character columns rather than factor, and would prefer not to have to switch back and forth between the data types. It wouldn't matter to me if this conversion is happening "behind the scenes". For what it's worth, an expanded summary() in the case of the numeric columns that included some of the high-frequency values would be interesting as well. Thank you in advance for any help in finding a way.

Upvotes: 2

Views: 5228

Answers (1)

akrun
akrun

Reputation: 887911

If it is to get an overall summary of the dataset, skim may be useful

skimr::skim(iris)

-output

── Data Summary ────────────────────────
                           Values
Name                       iris  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts               
1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
1 Sepal.Length          0             1  5.84 0.828   4.3   5.1  5.8    6.4   7.9 ▆▇▇▅▂
2 Sepal.Width           0             1  3.06 0.436   2     2.8  3      3.3   4.4 ▁▆▇▂▁
3 Petal.Length          0             1  3.76 1.77    1     1.6  4.35   5.1   6.9 ▇▁▆▇▂
4 Petal.Width           0             1  1.20 0.762   0.1   0.3  1.3    1.8   2.5 ▇▁▇▅▃

Upvotes: 1

Related Questions