catalini.com - Christian Catalini

  • Increase font size
  • Default font size
  • Decrease font size
catalini.com - Christian Catalini

Beware of STATA's insheet command

E-mail Print PDF

I've come across a bug in STATA's insheet command that is quite worrisome.

Many power users will have their original datasets in raw text format. This is often the case for data coming from a variety of sources (public datasets, data downloaded or scraped from the Internet, etc). Not only it is a good habit to store data as raw text, but it also reduces lock-in with a specific platform or software. You are more likely to open that file again in 5-10 years if it's in a pure text format.

A common format for raw text files is the tab-delimited one, as tabs rarely appear in data. Each column is separated by a tab symbol (\t) and when you import the data you use the tab to recognize when one variable ends and the next one begins.

Sometimes strings are enclosed within double quotes " " , but this can create issues if any of the quotes is missing in the original data. Personally, I prefer to avoid double quotes and rely on tabs to isolate one variable from another.

The STATA insheet command has a tab option that is supposed to do just that, i.e. take a raw text file and import it in memory using tabs as delimiters.

Now the problem is that insheet still relies on quotes if it finds them, even if most of the strings in a file are not enclosed in double quotes. This is a serious bug and can lead to some disastrous consequences.

Here is an example:

v1 v2 v3 
1Testing 123This is a "string" of text 
2Testing 456This is another "string of text 
3Testing 789One more "string" of text 

 

If you're working on data parsed from a website, data that contains user generated content or data that has been inserted by a human being, a missing double quote is very likely to occur, like in the VAR2 column above (row one is correct, but both in row 2 and 3 the quote symbol is not double). The raw file can be downloaded from here.

This will confuse insheet and import the data incorrectly. Moreover you won't receive any warning from STATA and it will look as if the file was imported correctly! 

If you type:

   insheet using test.txt, tab clear

Stata will return no errors and output:

   (3 vars, 2 obs)

 You dataset will look like this:

v1 v2 v3 
Testing 123  string of text 
2Testing 456     of text 

 

Now you may think that this is a minor issue and that it's easy to spot it while looking at the dataset, but now imagine a dataset with 500,000 rows and 50 variables. Unless you know your exact row count, you are very likely to miss a lot of data. What makes the bug even more creepy is that you may not just loose data at the bottom of the file, but at any intermediate row, which makes it harder to detect.As you can see, we lost 1 row of data and v3 contains the wrong information. This is because insheet saw the first quote and looked ahead until he could find another quote to close the string.

 

Here's a workaround:

1) MAC or LINUX USERS

Before using insheet, always type

shell wc - l nameofyourfile.txt

In the case above, this would return:

  3 test.txt

Where 3 equals the number of rows in your raw text file. If you notice that insheet imports less rows than 3, then you can search and replace any quotes (") in the raw file before import with your favourite text editor.

 

2) WINDOWS USERS

Before using insheet, always type

shell find /v /c "&randomtext&*" nameofyourfile.txt & pause

This will open a command shell and count the number of rows in your file that do not contain "&randomtext&", i.e. usually every row (feel free to make the string more complex if you want!). On large files, this may take a few seconds, so be patient.

Like in the mac case, the number you will see next to the filename will be the real number of rows.

 

Enjoy!

Christian 

 

The Next 36 first cohort: Tradyo wins Best Venture Award

E-mail Print PDF

 

From Tradyo.com: "We have cool stuff; you have cool stuff; everyone has cool stuff. The problem is, half of our stuff goes unused when it could be super valuable to someone else. Tradyo enables people to buy and barter the things they don't use with their neighbours in a simple, convenient, and downright fun way. Tradyo uses the GPS function on your smartphone to reveal the cool stuff available around you. The app is curiosity driven - who knows what kind of treasure you'll stumble upon?"

 http://www.tradyo.com 

 

The Geography of Crowdfunding

E-mail Print PDF


 

Ajay K. AgrawalChristian CataliniAvi Goldfarb

NBER Working Paper No. 16820
Issued in February 2011
NBER Program(s):   PR 

Perhaps the most striking feature of "crowdfunding" is the broad geographic dispersion of investors in small, early-stage projects. This contrasts with existing theories that predict entrepreneurs and investors will be co-located due to distance-sensitive costs. We examine a crowdfunding setting that connects artist-entrepreneurs with investors over the internet for financing musical projects. The average distance between artists and investors is about 3,000 miles, suggesting a reduced role for spatial proximity. Still, distance does play a role. Within a single round of financing, local investors invest relatively early, and they appear less responsive to decisions by other investors. We show this geography effect is driven by investors who likely have a personal connection with the artist-entrepreneur ("family and friends"). Although the online platform seems to eliminate most distance-related economic frictions such as monitoring progress, providing input, and gathering information, it does not eliminate social-related frictions. 

http://www.nber.org/papers/w16820 

 

Our crowd-funding paper wins summer grant from NET institute

E-mail Print PDF

The working paper on the geography of crowd-funding I've been working on with Ajay Agrawal and Avi Goldfarb, received a summer grant from the NET institute.

 About the NET institute: "The Networks, Electronic Commerce and Telecommunications ("NET") Institute is a non-profit institution devoted to research on network industries, electronic commerce, telecommunications, the Internet, "virtual networks" comprised of computers that share the same technical standard or operating system, and on network issues in general. Of particular interest is research on innovation and introduction of new technology in network industries. The NET Institute functions as a world-wide focal point for research and open exchange and dissemination of ideas in these areas. The NET Institute competitively funds cutting edge research projects in these areas of research. It organizes conferences and seminars on these issues." (Source: http://www.netinst.org/)

 

 

Does Distance Matter in Online Entrepreneurial Finance? Evidence from Crowd-Funding in the Arts

E-mail Print PDF

Ajay Agrawal, Christian Catalini, Avi Goldfarb

Abstract

The most striking feature of “crowd-funding” for early stage entrepreneurial projects is the broad geographic dispersion of investors. This stands in stark contrast to existing theories that predict entrepreneurs and investors will be co-located due to distance-sensitive costs. We examine a crowd-funding setting that connects artist-entrepreneurs to investors over the internet for financing early stage musical projects where the average distance between entrepreneur and investor is about 3,000 miles, suggesting a reduced role for spatial proximity. Still, distance does play a role. Local investors are more likely to invest in the very early stages of a single round of financing and are less responsive to decisions by other investors. We show this geography effect is driven by investors who likely have a personal connection with the entrepreneur (“family and friends”). Although the online market platform eliminates most distance-related economic frictions such as monitoring progress, providing input, and gathering information (e.g., local reputation, stage presence), it does not eliminate social-related frictions such as information more likely to be held by personally-connected individuals (e.g., entrepreneur’s tendency to persevere, recover from setbacks, succeed in other endeavors).

Download Working Paper from SSRN 

Agrawal, Ajay, Catalini, Christian and Goldfarb, Avi, Does Distance Matter in Online Entrepreneurial Finance? Evidence from Crowd-Funding in the Arts (October 29, 2010). NET Institute Working Paper No. 10-08. Available at SSRN: http://ssrn.com/abstract=1692661

 

Intellectual Property Disclosure in Open Standards Development

E-mail Print PDF

Timothy Simcoe and Christian Catalini

Firms typically want to know whether a technology is covered by Intellectual Property (IP) rights before making it an industry standard. To promote transparency, Standard Setting Organizations require participants to disclose their IP during technical deliberations. We study the effectiveness of these policies. Specifically, we examine a large sample of IP disclosures and find that these declarations are often not very informative. The majority of disclosure statements do not list any specific piece of IP, or offer information on pricing beyond a commitment to license on “reasonable and non-discriminatory” terms. We also link the disclosure data to administrative records from the Internet Engineering Task Force, and find that unless there is a commitment to royalty-free licensing, disclosures reduce the probability that a proposal becomes a standard. Thus, while many firms remain reluctant to reveal IP, under the right conditions disclosure policies seem able to promote ex ante technological competition within SSOs.  

 (link to slides) 

 

Tracing the links between science and technology: An exploratory analysis of scientists’ and inventors’ networks

E-mail Print PDF

The paper provides an exploratory analysis of the research networks linking scientists working in an open science environment, and researchers involved in the private technology domain. The study combines data on scientific co-authorship with data on patent co-invention, at the level of individual researchers, for three science-intensive technology fields, i.e. lasers, semiconductors and biotechnology, in order to assess the extent of the overlap between the two communities and to identify the role of key individuals in the process of knowledge transfer. Our findings reveal that the extent of the connectedness among scientists and inventors is rather large, and that particular individuals, i.e. authors-inventors, who act as gatekeepers and bridge the boundaries between the two domains, are fundamental to ensuring this connectivity. These individuals tend to occupy prominent positions in the scientific and the technological networks. However, our results also show maintaining a very central position in the scientific network may come at the expense of being able to fill a similarly central position in a technological network (and vice versa). Finally, preliminary analysis of the institutional origins of authors-inventors shows that one characteristic, distinctive of Europe compared to the United States, is associated with the relatively lower involvement of corporate scientists at the intersection between the two worlds of science and technology.

Download here: http://dx.doi.org/10.1016/j.respol.2009.11.004
WP version available under downloads.

 

SSO patents and disclosures database

E-mail Print PDF

This page links to a database containing information from the combined IPR disclosure statements made at several major Standard Setting Organizations. Please let us know if you find a new use for this information. If you have questions, or would like to help us collect more data on IPRs linked to industry standards, please contact either Tim or Christian using the links on the website.

http://www.ssopatents.org

A first draft of the paper behind the dataset will be presented by Tim at the AoM annual meeting in Chicago (TIM, BPS) on Monday, August 10th from 8:00-9:30am:

http://program.aomonline.org/2009/submission.asp?mode=showsession&SessionID=608

 

Markets Making Music - Sellaband and Angie Arsenault

E-mail Print PDF
 
This event was organized by the Martin Prosperity Institutes Program on Innovation and Creative Industries. It was hosted at the Rotman School of Management on February 10, 2009
 
 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  Next 
  •  End 
  • »


Page 1 of 2

About

Christian CataliniPhD Candidate in Strategy at the Rotman School of Management and technology enthusiast, I wrote my undergraduate degree thesis on the economics of open source development and my MSc final dissertation on "The link between science and technology: exploring the network of inventors and scientific authors in the semiconductor industry". After working at KITES-CESPRI Bocconi on the European research project “Highly cited patent”, I've started my PhD in Strategic Management at Rotman. Current projects include "Markets Making Music", with Ajay Agrawal; "Intellectual Property and the Diffusion of Formal Standards", with Timothy Simcoe; "Authors-inventors: life on the boundary between science and technology", with Stefano Breschi.

Areas of interest: economics of innovation, the market for ideas, knowledge flows between science and technology, open source, distributed innovation creative industries, entrepreneurship.


Who's online

We have 1 guest online

Twitter updates


search