Home > Press > What People are saying > User Stories & Interviews > Douglas Samuelson...

IMPROVED UTILIZATION AND RETRIEVAL OF ANECDOTAL INFORMATION

Douglas A. Samuelson, Industrial and Systems Engineering Program, Memphis State University


KEY WORDS: Text retrieval, data base management systems, human rights, survey research

(Republished with permission as presented at the American Statistical Annual Meeting, August 1992.)

_____

Abstract


         Anecdotal information is more complete and potentially more informative than the quantitative data and coded summaries which statisticians and social scientists customarily use; but absorbing the information is time-consuming.  As numbers and codes are easier to process, compare, summarize and analyze,  natural-language information is often neglected.


     We discuss recent advances in computer text retrieval systems, which make it possible to organize, summarize and analyze large numbers of natural-language observations.  We illustrate with examples of medical case histories, transportation accidents, and human rights violation reports.


1. Introduction


     To summarize and compare of large numbers of observations, statisticians and social scientists frequently use statistical summaries and graphical presentations to make the information in the data usable.  These approaches require that the information be stored and retrieved in relatively inflexible ways, such as standardized coding or relational data bases with predefined fields.  This requirement, in turn, can limit analysis and cause useful information to be overlooked.


     Anecdotal, natural-language (also called "free text") reports and comments contain more information than the coded forms, but the non-standard form of such comments makes them impossible to use as part of typical quantitative analyses.  In some cases, such as human rights reporting, the low level of quantitative training prevalent among practitioners in the field has posed a major impediment to large-scale and comparative analyses.


     The inflexibility of quantitative approaches has promoted counter-productive divisions between "quantitative" and "non-quantitative" approaches in the social sciences: some "quantitative" analysts of the author's acquaintance tend to dismiss thorough, careful, informative descriptions as "merely anecdotal," while some "non-quantitative" analysts deride statistical studies as one-dimensional and incomplete.  For some "quantitative" partisans, "anecdotal" seems to be nearly synonymous with "not scientifically rigorous."  


     The problem is not that natural-language information is less valuable than numbers, but that it is more difficult to "map" natural-language information into usable forms for quantitative analysis.  A thorough natural-language description, based on careful observation, often contains much quantitative information: the problem is how to extract it and how to evaluate it.  If we are told, for example, that a man walked across a large lake, we may suggest that he could have used some sort of illusion.  If, in addition, we learn that several unbiased, reliable witnesses checked carefully and found no hidden supports under the water, we are somewhat more willing to believe that a highly improbable event occurred, even though no one took quantitative measurements.  If we have videotapes of the event as well, we may be even more ready to believe that the event really happened as described.   In each instance, we can draw on context and on other knowledge to support quantitative concl!

usions about the strength of the evidence.


     In this example, what gives us trouble is developing more precise estimates of quantities: how improbable is a man walking on water?  How improbable is an illusion which would escape detection by whatever means the observers used?  If a number of similar incidents are reported, how likely is it that there is a pattern of common causes?   Can we conclude which of a number of hypothesized patterns is most likely?   If the evidence is credible and the matter is of interest, discarding the entire event as "merely anecdotal" is not satisfactory: we can and should know more.  To do so requires extracting some quantitative information from the reports we have.  The purpose of this paper is to show how recent advances in technology can help to re-integrate anecdotal and quantitative information.



2.  Text-Retrieval Software


     In the past twenty years, many files of statistics have been stored in computers and, eventually,  incorporated into data-base management systems (DBMS's) to facilitate selective retrieval.  Most data base management systems include " comment" or "memo" fields in which free text can be entered.  Classical methods of retrieval, however, are based on predefined fixed-format key fields.  If a certain type of comment appears frequently and the analyst would like to retrieve all records with this type of comment (or a subset based on additional criteria), he must define a new key field which indicates whether the desired type of comment is present in the record's "memo" fields.  He must then manually code this new key field record by record, based on examination of each record's "memo" fields.


     Recent advances in computer data-retrieval software offer great potential for improvement in using non-quantitative data in quantitative analyses.   Many analysts are now supplementing or replacing rigid-format data base systems by data base management systems which allow the mixing of rigid formats and free-text material.  These systems include powerful search algorithms which can, for example, retrieve all documents which contain a specified phrase, regardless of where that phrase appears (in the formatted fields or in free text).  Imaginative use of such retrieval systems can greatly simplify the task of translating field observations into useful data for quantitative studies.  


     We consider a small-scale demonstration of the use of one such storage and retrieval system, askSam, in the quantitative analysis of anecdotal information about general aviation accidents.  We then discuss early work in progress for two other applications: medical case histories and human rights violation reports.  Finally, we comment more generally regarding the potential use of  askSam or similar systems in reporting and analysis of events which have traditionally been difficult to analyze quantitatively because of the non-quantitative or non-standardized nature of the reported information.  


3.  General Aviation Accidents


     The Air Safety Foundation (ASF), a division of the Aircraft Owners and Pilots Association (AOPA), provides information and training for general aviation pilots nationwide to improve the safety of general aviation.  The ASF relies on various sources of information about general aviation accidents to identify causes and contributing factors and to develop recommendations.  One of the most important of these sources is the computer file maintained, in a conventional, predominantly fixed-field data base format, by the National Transportation Safety Board (NTSB), listing all (U. S.) domestic general aviation accidents and summarizing the NTSB's investigative findings concerning the accidents.


     The reports are indexed by key words which include one primary and up to three secondary cause codes.  The accident investigative team selects these codes from a standard list maintained by NTSB.   The report also identifies, in standard codes, such items as the weather conditions, the aircraft type, and the phase of flight in which the accident occurred.


     These codes, however, can partially obscure or over-simplify the circumstances of an accident.  If, for example, a pilot fails to negotiate a powerless landing, the crash will be attributed to that failure and the "approach" or "landing" phase of the flight will be listed as the phase in which the accident occurred.  Additional detail, however, may reveal that the pilot was attempting a powerless landing because of a fuel system failure during an earlier portion of the flight.  Such secondary causes are not always evident from examining only the cause codes and the reported phase of flight in which the problem occurred.


     The report records also contain free-form "comment" or "memo" fields in which the investigators can write a few lines, in plain English (free text), about the accident.  Here the reporters include elements such as the fuel system failure which forces an attempted powerless landing.  Typical data base programs would make these comments available, record by record, for the analyst to examine.  If, however, the analyst wanted to count how many powerless landings involved fuel system failures on a particular aircraft type, standard data base systems offer no help.


     I therefore recommended to the ASF, in late 1989, that they consider the potential use of a text retrieval system to extract additional information from the free text fields of the general aviation accident data base they already have and maintain.  They arranged to have me perform a small-scale demonstration, making a few inquiries on one year's data.  One of the patterns I found was the one cited above as an example: a number of powerless-landing crashes which, upon close examination of the comments, turned out to be related to earlier fuel exhaustion.  I also found that, for almost all of these crashes, the fuel exhaustion problem appeared on approach, in cold weather, with a single aircraft type.



     This example made the demonstration a big success, as it transpired that the Air Safety Foundation already knew about this pattern.  They pointed out an additional factor I had missed: most of the pilots generally flew a number of different types of aircraft.  The explanation was that the aircraft in which these accidents had occurred had the fuel line shutoff valve where other types had the cabin heater control.  Pilots accustomed to different aircraft had, therefore, inadvertently shut off their fuel while trying to lower the cabin temperature slightly so that they would remain alert for landing.  The problem pattern had been discovered when an experienced AOPA pilot incurred the fuel problem on approach, immediately reversed the last thing he had done (good pilot training and instincts), and then reported the incident to his colleagues in AOPA and ASF.

       

     The point is that, using the text retrieval sys- tem, a statistician with limited knowledge of general aviation found a pattern that had eluded many pilots and investigators.  Some experts in general aviation had eventually worked out the same pattern, but the demonstration showed that, using the text-retrieval approach, an analyst with much less aviation knowledge could find the same pattern much more quickly.


4.  Medical Case Histories


     A large mid-South hospital recently installed a nursing information system, with point-of-care terminals, in a cardiovascular surgery ward.  The hos- pital's management-nursing team, formed to implement the system, was interested in evaluating what benefits resulted from the system.  They expected an improvement in quality and consistency of documentation, and therefore in collection of payments; they also hoped for an improvement in nurse-patient interaction because of reduced time spent in charting and chart review.



     As part of an early step in this evaluation, the hospital's implementation team decided to review a number of charts from both before and after the system was installed.  The nurses were required to enter physical measurements, such as fluid intake and outflow, body temperature, heart rate, blood pressure; treatment orders; and general assessments of the patient's alertness and progress.  The team intended to assess the completeness of the entries and the consistency among physical measurements and nurses' observations.  


     The system software also included some text processing and retrieval capabilities, however, and the automated chart included space for nurses' notes.  The review disclosed that nurses' notes were much more easily retrieved, compared, and summarized for the charts entered through the system.  As a result, useful discussions among the nurses and their supervisors brought about further standardization of terms used in the notes, and discussions of what conditions to look for in patients.  These discussions led, in turn, to significant enhancements of the entry screens for the fixed-field data elements.


     The discussions and self-examination also led the team to realize that nurses' notes manually entered were rarely reviewed, and that such reviews usually consisted of someone (the attending physician or a nursing supervisor, for example) extensively interviewing the nurse who had made the notes.  The quality review teams were examining patterns in the nurses' notes for the first time, now that computer-assisted summaries and comparisons were feasible.


     This is highly preliminary work, but the hospital's management and nurses continue to refine both the computer system and their procedures.  They have found the system's capability for collecting and retrieving nurses' notes very useful in reviewing and improving their procedures and processes.


5.  Human Rights Violations


     One of the most frustrating and difficult aspects of advocacy of human rights is documenting whether a human rights violation has occurred.  Generally, meaningful intervention can take place only after a number of governments and international agencies officially agree that actionable offenses have taken place.  Such agreement usually requires factual evidence and often involves statistical issues: how strong is the evidence that a violation occurred?  Is there a pattern of similar events?  What is the magnitude of the events?


     Obviously, such events do not take place under careful observation by unbiased, statistically trained observers.  Those who do observe, even if they do so carefully and objectively, are unlikely to have much understanding of the kinds of questions statisticians raise about evidence concerning such events.  This disparity between observers' collection of data and decision-makers' preferred presentation of data constitutes one of the most important "barriers to belief"; breaking through such barriers can be at least as important and as difficult as ascertaining the truth. (Frelick, 1989)


     An instance of this difficulty is the experience of a team of physicians who traveled to some ref- ugee camps in Turkey in October, 1988, under the auspices of Physicians for Human Rights, to examine and interview a number of Kurdish refugees from Iraq.  The team conducted physical examinations and administered, through interpreters, a structured interview concerning the circumstances in which the people's injuries had occurred.  Many of the refugees had burns on their hands, arms and faces; the physicians concluded that these burns were almost certainly chemical in nature and origin.  The refugees reported fairly consistently that they had been burned shortly after aircraft flew over their villages dropping a yellowish liquid.


     When the physicians returned to the U. S. and attempted to publish their findings, their article was at first rejected by a prominent medical journal because at least one reviewer raised concerns about the number of subjects, the absence of a control group, and other objections more appropriate to a clinical trial than to field observations.  The physicians were uncertain about how to address these concerns.


     Fortunately, their data were sufficiently detailed, precise and consistent in terminology that, with some prompting from a statistician they consulted, they were able to pose and answer such questions as: how often, under normal circumstances, would one expect to see injuries such as these in a population of desert-dwelling, agricultural nomads? (Extremely rarely, as such people ordinarily would not even be exposed to such "common, house- hold" chemicals as bleach or  lye.)  How probable is it that such injuries would occur nearly simultaneously at random to more than 100 individuals in such a population?  (Infinitesimal.)  What other events could cause such injuries?  (Chemical accidents, such as might occur in industrial areas, and deliberate attack are the only causes likely enough to be worth considering.)  Applying Bayes' Theorem, what is the posterior probability that a deliberate attack occurred?  (Considerably more likely than not.)  The revised report was accepted an!

d published -- without the explicit use of Bayes' Theorem, I regret to say.  (Hu et. al., 1989)


     This episode illustrates the value of summarizing non-quantitative information in light of quantitative questions.  What made the convincing summary possible in this case was the careful, thorough and consistent examination of the victims.  While the physicians had limited quantitative training, they were able, with the statistician's guidance,  to conclude that certain causes were extremely unlikely and others were quite likely.  The statistician, with a little creativity, could supply the rest of a solid quantitative analysis.


     We must note that much useful work is being done by others in improving the collection, organization and utilization of information regarding human rights violations.  Recent work on the Human Rights Information and Documentation System (HURIDOCS) in developing standard reporting formats is an excellent example of how to make anecdotal reports much easier to compare and summarize, thereby increasing their value to quantitative analysts.  (Dueck, 1992)   In 1986, human rights organizations used computers and data bases very effectively, utilizing an earlier standard set of reporting formats, in helping to identify and bring to justice, under a short time deadline, over 1000 of the perpetrators from the "dirty war" in Argentina.  (Dueck, 1992, pp. 130-131)


6.  Implications and Recommendations

     

     It is no coincidence that all the examples cited here involved observers trained in a discipline (transportation accident investigation, nursing or medicine) with a standard terminology, significant consensus about the description and classification of events, and an emphasis on careful, thorough reporting.  If, as it seems from the Physicians for Human Rights example, it is easier for statisticians to learn to use anecdotal data more effectively than for non-statisticians to become comfortable with statistical methods, the most promising strategy for collaboration is to promote good reporting and standard language and then, using the methods and tools outlined here, to extract the statistics from those reports.  I recommend this approach for occasions when non-quantitative people, such as workers in non-governmental organizations which promote human rights, request training and analytical assistance from the statistical  profession.


     There are also significant implications for more conventional survey research, in the social sci- ences and elsewhere.  A well-known problem of survey research is dealing with exception responses: added response categories on multiple- choice questions, comments which indicate con- fusion about terms, and more general remarks.  These responses are important both at the pretesting stage, when they suggest modifications of the instrument, and in the analysis stage, when they may provide important information not captured by the coded responses.


     In some preliminary work some colleagues and I have recently conducted, using a text-retrieval system to enter, review and analyze pretest results has proven more efficient and useful than typical reviews.  The ability to ascertain quickly  how often an added response to a question occurs, or how many respondents commented on a question in some way, has helped us to focus readily on the most important needs for modification.  We recommend the use of text-retrieval systems to speed analysis of pretest results.


     In addition, for the final instrument, we can use more open-ended questions without loss of quantitative precision or detail.  For example, we can ask, "In what religious denominations have you participated?", with room for written responses, rather than "Circle all the names of religious denominations in which you have participated" with a grid of 50 names and a high probability of offending some respondents by omitting their favorite.  We have also found that the less formal in- strument encourages free-form responses.  Some of these contain unusual information which we would have been very unlikely to obtain via a more conventional instrument.  We therefore recommend that survey researchers experiment with less structured items, relying on text-retrieval software to help summarize and analyze the responses.


     In general, we see that computer-based text- retrieval systems can greatly facilitate the use of non-quantitative information in quantitative sum- maries and comparisons.  To utilize such systems to their potential, however, observers should:

   -- record their data as carefully and precisely as possible;

   -- use standard terms, and use terminology consistently; and

   -- give the statistically trained reader as much help as possible in

        deciding whether the events described would be likely or unlikely

        to result from various causes.


     We look forward to learning of other investigators' experiences in unusual applications of text-retrieval systems, such as those described here, and in enhancing more conventional surveys. In both types of application, the creative use of text-retrieval systems to compare and summarize anecdotal responses offers great promise for reducing the division between "primarily quantitative" and "primarily non-quantitative" social science.



References


Dueck, Judith, "HURIDOCS Standard Formats as a Tool in the  

   Documentation of Human Rights Violations," in Jabine, Thomas J.

   and Claude,  Richard P., eds., Human Rights and Statistics:

   Getting the Record Straight, University of Pennsylvania Press, 1992.


Frelick, Bill, "Refugees: Contemporary Witnesses to Genocide,"

   presented at the Genocide Watch Conference, Institute for the Study of

   Genocide, John Jay College of Criminal Justice, New York,

   May 22, 1989.


Hu, Howard, Cook-Deegan, Robert, and Shukri, Asfandiar, "The Use of

   Chemical Weapons:   Conducting an Investigation Using Survey

   Epidemiology," Journal of the American Medical Association, Vol. 262,

   No. 5, Aug. 4, 1989, pp. 640-643.



Acknowledgement


The author gratefully acknowledges Herb Spirer for numerous helpful comments and suggestions.

Quick Downloads

 

How people use askSam

 

Surf Report Newsletter

Subscribe today to receive our FREE monthly newsletter. The Surf Report includes tips, articles, and information about new releases, upgrades, free utilities, and special promotions. Sign up today!


Read Back Issues »
 

"askSam is an essential part of my software tool chest. I can research and collect data from anywhere and any source. Once it is in askSam I can edit, rearrange, organize, and search the information easily. Then I can present it and make it totally useful for other people via the web or CD. Fantastic!"

-- Valda Hilley, Author, Literary Agent, Teaching Consultant, Pack rat, and President, Convergent Press, Ltd.

 

Seaside Software Inc. DBA askSam Systems, 121 S Jefferson Street, Perry FL 32347
Telephone: 800-800-1997 / 850-584-6590   •   Email: info@askSam.com   •   Support: http://www.askSam.com/central.asp
© Copyright 1985-2012   •   Privacy Statement