Jack
Jack

Reputation: 167

How to convert one data frame to another in R?

I have downloaded txt. file from Kenneth R. French library which can be found by following the link http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_48_ind_port.html.

I need to use these so called SIC codes to split my sample into different portfolios based on industry factor. The downloaded file looks like this:

      1 Food  
      0100-0199 Agric production - crops
      0200-0299 Agric production - livestock
      0700-0799 Agricultural services
      0900-0999 Fishing, hunting & trapping
      2000-2009 Food and kindred products
      2010-2019 Meat products
      2020-2029 Dairy products
      2030-2039 Canned-preserved fruits-vegs
      2040-2046 Flour and other grain mill products
      2047-2047 Dog and cat food
      2048-2048 Prepared feeds for animals
      2050-2059 Bakery products
      2060-2063 Sugar and confectionery products
      2064-2068 Candy and other confectionery
      2070-2079 Fats and oils
      2080-2080 Beverages
      2082-2082 Malt beverages
      2083-2083 Malt
      2084-2084 Wine
      2085-2085 Distilled and blended liquors
      2086-2086 Bottled-canned soft drinks
      2087-2087 Flavoring syrup
      2090-2092 Misc food preps
      2095-2095 Roasted coffee
      2096-2096 Potato chips
      2097-2097 Manufactured ice
      2098-2099 Misc food preparations
      5140-5149 Wholesale - groceries & related prods
      5150-5159 Wholesale - farm products
      5180-5182 Wholesale - beer, wine
      5191-5191 Wholesale - farm supplies

      2 Mines 
      1000-1009 Metal mining
      1010-1019 Iron ores
      1020-1029 Copper ores
      1030-1039 Lead and zinc ores
      1040-1049 Gold & silver ores
      1060-1069 Ferroalloy ores
      1080-1089 Mining services
      1090-1099 Misc metal ores
      1200-1299 Bituminous coal
      1400-1499 Mining and quarrying non-metalic minerals
      5050-5052 Wholesale - metals and minerals

      3 Oil
      1300-1300 Oil and gas extraction
      1310-1319 Crude petroleum & natural gas
      1320-1329 Natural gas liquids
      1380-1380 Oil and gas field services
      1381-1381 Drilling oil & gas wells
      1382-1382 Oil-gas field exploration
      1389-1389 Oil and gas field services
      2900-2912 Petroleum refining
      5170-5172 Wholesale - petroleum and petro prods

      4 Clths 
      2200-2269 Textile mill products
      2270-2279 Floor covering mills
      2280-2284 Yarn and thread mills
      2290-2295 Misc textile goods
      2296-2296 Tire cord and fabric
      2297-2297 Nonwoven fabrics
      2298-2298 Cordage and twine
      2299-2299 Misc textile products
      2300-2390 Apparel and other finished products
      2391-2392 Curtains, home furnishings
      2393-2395 Textile bags, canvas products
      2396-2396 Auto trim
      2397-2399 Misc textile products
      3020-3021 Rubber and plastics footwear
      3100-3111 Leather tanning and finishing
      3130-3131 Boot, shoe cut stock, findings
      3140-3149 Footware except rubber
      3150-3151 Leather gloves and mittens
      3963-3965 Fasteners, buttons, needles, pins
      5130-5139 Wholesale - apparel

The thing I want to do is that to create data frame where the first column gives the industry name (for instance, Food, Mining and Minerals and so on) and the second column all SIC codes ( Standard Industrial Code) related to this industry (as mostly SIC codes are given by the way as 5130-5139 which makes to do it a bit harder).

This data frame will make my analysis easier to implement.

Any suggestion will be highly appreciable.

Upvotes: 0

Views: 112

Answers (2)

Troy
Troy

Reputation: 8701

How about this?

df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""])              # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10))           # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!="          ") # make an index
ind<-df[df$prefix!="          ",]         # make an index table
ind$desc<-substring(ind$t,11,100)         # parse descriptions
final<-merge(ind[,c("index","desc")],     # merge the index table
             df[df$prefix=="          ",c("index","t")],  # with all non-title rows of the list
             by="index")                                  # by index

head(final,10)

   index          desc                                                       t
1      1   Agriculture                      0100-0199 Agric production - crops
2      1   Agriculture                  0200-0299 Agric production - livestock
3      1   Agriculture                         0700-0799 Agricultural services
4      1   Agriculture                            0910-0919 Commercial fishing
5      1   Agriculture                    2048-2048 Prepared feeds for animals
6      2 Food Products                     2000-2009 Food and kindred products
7      2 Food Products                                 2010-2019 Meat products
8      2 Food Products                                2020-2029 Dairy products
9      2 Food Products                  2030-2039 Canned-preserved fruits-vegs
10     2 Food Products           2040-2046 Flour and other grain mill products

You can also add this to break out the codes into a separate column:

final$codes<-substr(gsub("          "," ",final$t),2,10)

Upvotes: 1

lukeA
lukeA

Reputation: 54277

This will produce a 2-column data frame df.new, which contains the comma-seperated codes in column 2:

df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
  m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
  r <- regmatches(x$V3[-1], m)
  fromTo <- t(sapply(r, "[", 2:3))
  paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))

Upvotes: 2

Related Questions