Reputation: 167
I have downloaded txt. file from Kenneth R. French library which can be found by following the link http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_48_ind_port.html.
I need to use these so called SIC codes to split my sample into different portfolios based on industry factor. The downloaded file looks like this:
1 Food
0100-0199 Agric production - crops
0200-0299 Agric production - livestock
0700-0799 Agricultural services
0900-0999 Fishing, hunting & trapping
2000-2009 Food and kindred products
2010-2019 Meat products
2020-2029 Dairy products
2030-2039 Canned-preserved fruits-vegs
2040-2046 Flour and other grain mill products
2047-2047 Dog and cat food
2048-2048 Prepared feeds for animals
2050-2059 Bakery products
2060-2063 Sugar and confectionery products
2064-2068 Candy and other confectionery
2070-2079 Fats and oils
2080-2080 Beverages
2082-2082 Malt beverages
2083-2083 Malt
2084-2084 Wine
2085-2085 Distilled and blended liquors
2086-2086 Bottled-canned soft drinks
2087-2087 Flavoring syrup
2090-2092 Misc food preps
2095-2095 Roasted coffee
2096-2096 Potato chips
2097-2097 Manufactured ice
2098-2099 Misc food preparations
5140-5149 Wholesale - groceries & related prods
5150-5159 Wholesale - farm products
5180-5182 Wholesale - beer, wine
5191-5191 Wholesale - farm supplies
2 Mines
1000-1009 Metal mining
1010-1019 Iron ores
1020-1029 Copper ores
1030-1039 Lead and zinc ores
1040-1049 Gold & silver ores
1060-1069 Ferroalloy ores
1080-1089 Mining services
1090-1099 Misc metal ores
1200-1299 Bituminous coal
1400-1499 Mining and quarrying non-metalic minerals
5050-5052 Wholesale - metals and minerals
3 Oil
1300-1300 Oil and gas extraction
1310-1319 Crude petroleum & natural gas
1320-1329 Natural gas liquids
1380-1380 Oil and gas field services
1381-1381 Drilling oil & gas wells
1382-1382 Oil-gas field exploration
1389-1389 Oil and gas field services
2900-2912 Petroleum refining
5170-5172 Wholesale - petroleum and petro prods
4 Clths
2200-2269 Textile mill products
2270-2279 Floor covering mills
2280-2284 Yarn and thread mills
2290-2295 Misc textile goods
2296-2296 Tire cord and fabric
2297-2297 Nonwoven fabrics
2298-2298 Cordage and twine
2299-2299 Misc textile products
2300-2390 Apparel and other finished products
2391-2392 Curtains, home furnishings
2393-2395 Textile bags, canvas products
2396-2396 Auto trim
2397-2399 Misc textile products
3020-3021 Rubber and plastics footwear
3100-3111 Leather tanning and finishing
3130-3131 Boot, shoe cut stock, findings
3140-3149 Footware except rubber
3150-3151 Leather gloves and mittens
3963-3965 Fasteners, buttons, needles, pins
5130-5139 Wholesale - apparel
The thing I want to do is that to create data frame where the first column gives the industry name (for instance, Food, Mining and Minerals and so on) and the second column all SIC codes ( Standard Industrial Code) related to this industry (as mostly SIC codes are given by the way as 5130-5139 which makes to do it a bit harder).
This data frame will make my analysis easier to implement.
Any suggestion will be highly appreciable.
Upvotes: 0
Views: 112
Reputation: 8701
How about this?
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""]) # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10)) # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!=" ") # make an index
ind<-df[df$prefix!=" ",] # make an index table
ind$desc<-substring(ind$t,11,100) # parse descriptions
final<-merge(ind[,c("index","desc")], # merge the index table
df[df$prefix==" ",c("index","t")], # with all non-title rows of the list
by="index") # by index
head(final,10)
index desc t
1 1 Agriculture 0100-0199 Agric production - crops
2 1 Agriculture 0200-0299 Agric production - livestock
3 1 Agriculture 0700-0799 Agricultural services
4 1 Agriculture 0910-0919 Commercial fishing
5 1 Agriculture 2048-2048 Prepared feeds for animals
6 2 Food Products 2000-2009 Food and kindred products
7 2 Food Products 2010-2019 Meat products
8 2 Food Products 2020-2029 Dairy products
9 2 Food Products 2030-2039 Canned-preserved fruits-vegs
10 2 Food Products 2040-2046 Flour and other grain mill products
You can also add this to break out the codes into a separate column:
final$codes<-substr(gsub(" "," ",final$t),2,10)
Upvotes: 1
Reputation: 54277
This will produce a 2-column data frame df.new
, which contains the comma-seperated codes in column 2:
df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
r <- regmatches(x$V3[-1], m)
fromTo <- t(sapply(r, "[", 2:3))
paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))
Upvotes: 2