I've come across a bug in STATA's insheet command that is quite worrisome.
Many power users will have their original datasets in raw text format. This is often the case for data coming from a variety of sources (public datasets, data downloaded or scraped from the Internet, etc). Not only it is a good habit to store data as raw text, but it also reduces lock-in with a specific platform or software. You are more likely to open that file again in 5-10 years if it's in a pure text format.
A common format for raw text files is the tab-delimited one, as tabs rarely appear in data. Each column is separated by a tab symbol (\t) and when you import the data you use the tab to recognize when one variable ends and the next one begins.
Sometimes strings are enclosed within double quotes " " , but this can create issues if any of the quotes is missing in the original data. Personally, I prefer to avoid double quotes and rely on tabs to isolate one variable from another.
The STATA insheet command has a tab option that is supposed to do just that, i.e. take a raw text file and import it in memory using tabs as delimiters.
Now the problem is that insheet still relies on quotes if it finds them, even if most of the strings in a file are not enclosed in double quotes. This is a serious bug and can lead to some disastrous consequences.
Here is an example:

| v1 | v2 | v3 |
| 1 | Testing 123 | This is a "string" of text |
| 2 | Testing 456 | This is another "string of text |
| 3 | Testing 789 | One more "string" of text |
If you're working on data parsed from a website, data that contains user generated content or data that has been inserted by a human being, a missing double quote is very likely to occur, like in the VAR2 column above (row one is correct, but both in row 2 and 3 the quote symbol is not double). The raw file can be downloaded from here.
This will confuse insheet and import the data incorrectly. Moreover you won't receive any warning from STATA and it will look as if the file was imported correctly!
If you type:
insheet using test.txt, tab clear
Stata will return no errors and output:
(3 vars, 2 obs)
You dataset will look like this:
| v1 | v2 | v3 |
| 1 | Testing 123 | string of text |
| 2 | Testing 456 | of text |
Now you may think that this is a minor issue and that it's easy to spot it while looking at the dataset, but now imagine a dataset with 500,000 rows and 50 variables. Unless you know your exact row count, you are very likely to miss a lot of data. What makes the bug even more creepy is that you may not just loose data at the bottom of the file, but at any intermediate row, which makes it harder to detect.As you can see, we lost 1 row of data and v3 contains the wrong information. This is because insheet saw the first quote and looked ahead until he could find another quote to close the string.
Here's a workaround:
1) MAC or LINUX USERS
Before using insheet, always type
shell wc - l nameofyourfile.txt
In the case above, this would return:
3 test.txt
Where 3 equals the number of rows in your raw text file. If you notice that insheet imports less rows than 3, then you can search and replace any quotes (") in the raw file before import with your favourite text editor.
2) WINDOWS USERS
Before using insheet, always type
shell find /v /c "&randomtext&*" nameofyourfile.txt & pause
This will open a command shell and count the number of rows in your file that do not contain "&randomtext&", i.e. usually every row (feel free to make the string more complex if you want!). On large files, this may take a few seconds, so be patient.
Like in the mac case, the number you will see next to the filename will be the real number of rows.
Enjoy!
Christian







PhD Candidate in Strategy at the Rotman School of Management and technology enthusiast, I wrote my undergraduate degree thesis on the economics of open source development and my MSc final dissertation on "The link between science and technology: exploring the network of inventors and scientific authors in the semiconductor industry". After working at KITES-CESPRI Bocconi on the European research project “Highly cited patent”, I've started my PhD in Strategic Management at Rotman. Current projects include "Markets Making Music", with Ajay Agrawal; "Intellectual Property and the Diffusion of Formal Standards", with Timothy Simcoe; "Authors-inventors: life on the boundary between science and technology", with Stefano Breschi.