user1690130
user1690130

Reputation: 500

Stata - inputting data from .txt with "" and ,

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.

The original data are of the form:

 First Name: Allen
 Last Name: Von Schmidt
 Birth Year: 1965
 Location: District 1, Ocean City, Cape May, New Jersey, USA

 First Name: Lee Roy
 Last Name: McBride
 Birth Year: 1967
 Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA

The goal is to create the variables in Stata:

  First Name: Allen
  Last Name: Von Schmidt
  Birth Year: 1965
  County: Cape May
  State: New Jersey

  First Name: Allen
  Last Name: McBride
  Birth Year: 1967
  County: Cook
  State: Illinois

What possible .txt might lead to such, and how would I load it into Stata?

Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.

At the moment, I am putting "", around each variable from the table for the .txt.

 "Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
 "Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"

Is there a better way to format the .txt? How would I create the corresponding variables in Stata?

Thank you for your help!

P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""

Upvotes: 1

Views: 174

Answers (2)

Nick Cox
Nick Cox

Reputation: 37208

This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.

One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").

For example the first commma is found by strpos() like this

. di strpos("abcd,efg,h", ",")
5

and the last comma like this

. di strpos(reverse("abcd,efg,h"), ",")
2

Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.

. di substr("abcd,efg,h", -2, 2)
,h

These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.

. gen poslastcomma = strpos(reverse(var), ",") 
. gen var_end = substr(var, -poslastcomma, poslastcomma) 
. gen var_begin = substr(var, 1, length(var) - poslastcomma) 

Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.

By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.

In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.

split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.

Upvotes: 0

dimitriy
dimitriy

Reputation: 9460

There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:

#delimit;
input
str100(first_name last_name yob geo);
 "Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;

compress;
destring, replace;

split geo, parse(,);

rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;

The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:

 #delimit;
 insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;

Then clean it up as in the first example.

Upvotes: 3

Related Questions