Traduzione e Interpretazione

Transcription tools

I have been testing transcription tools for years, and the first things I wrote on this topic dates back to more than one year ago (see the post Useful Software for Transcription” at or download the poster I presented at Modena International Workshop from the page My Writings).

Testing provides food for thought, and thoughts gradually took the form of a paper on the question of transcription which shall hopefully be published within the end of my PhD.

This post is a technical complement to that mainly theoretical paper, as it presents (with descriptions and screenshots that would have hardly fitted in an academic article) the tools I have personally tested in order to find the best solution for my transcription needs. It contrasts transcriptions of the same oral data made from scratch using Transana, Xtrans, SpeechIndexer, and Exmaralda, then hints at two other tools you may use during the transcription process, namely Dragon and Xaira. As such, it will hopefully be of some use for those of you who may still be looking for the software fulfilling their specific desires, and who may hence benefit from knowing multiple tools in order to finally find their own.

Depending on the kind of analysis one wishes to carry out, some tools may be more suitable than others. But since at first glance, all of them look promising and offer a variety of useful features, it can be difficult for the researcher wishing to use such tools to determine whether a particular one is suitable for her or his data, research question or available computer.

To decide about usefulness and usability, it is necessary to know about the ease of use, strengths/weaknesses for specific annotation purposes, and the type of data or analysis the tool is designed for – knowledge that is usually gained only after becoming an expert in the use of a particular tool (Rohlfing et al., 2006: 99).

The goal of the workshop where the above quotation comes from, and of the present post, is, thus, to present information about and demonstrations of four of these tools, and share the views I developed during the months I spent becoming an “expert” of one or the other. The rationale for the choice of these particular four for comparison is that they all displayed the waveform, thus allowing for an exact quantification of pauses, and all aligned transcript and audio, thus enabling the transcriber to immediately listen to the oral features that are so difficult to write down.

Tools will be presented with respect to their main idea, distinctive features, usability and possible drawbacks, and accompanied by a screenshot of what the transcript looks like in the tool itself. A summarizing table will finally be provided, where basic information and features will be displayed and compared.

Transana (Version 2.05) is a software for professional researchers who want to analyze digital video or audio data, developed at the University of Wisconsin.

It directly supports Jeffersonian Transcription Notation with icons and shortcuts in both the transcription and the visualization windows. Another distinctive feature is that Transana enables you to align text and audio, but this can only be done manually and inconsistently (i.e. putting the cursor in the exact point, both in the audio and in the transcript, where one wants to add a time code). When you export your transcript, these multimedia time codes are included in an .rtf file, which may require some further work in order to obtain the .txt file that is usually uploaded into interrogation and concordance tools (such as Wordsmith, ParaConc or AntConc).

The interface is very user-friendly, and Transana is definitely the easiest transcription software I have ever tried. In can be learned in minutes, and online guided tours and demo videos are provided at the following addresses: and

The main drawback, alongside the fact that the latest versions of the software are not free, is the other side of the Jeffersonian coin, namely the difficulty to turn the transcript into something machine-readable, and hence analyzable with query software (at least in the version I have tested).

Fig. 1 What Transana looks like

This screenshot shows you what Transana looks like. The file structure is shown on the right, the audio waveform is at the top, and the transcript is in the middle. Pauses, overlaps, events, lengthening, emphasis, inhalation, exhalation and tempo are annotated following Jeffersonian Transcription Notation. The highlighted segment shows you, in particular, how overlaps between two participants speaking at the same time are treated. The same overlap will be similarly highlighted in the other tools, so as to underline their differences with respect to this important feature.

XTrans (Version 1.1 for Windows) is a multi-platform, multilingual, multi-channel transcription tool developed by Linguistic Data Consortium [LDC] to support manual transcription and annotation of audio recordings.

It provides solutions to the usual transcription challenges: it eases the transcribing activity and it can link audio with transcript. Differently from Transana, this is done automatically and consistently, but the time codes are once again incorporated into the exported file. Despite its being a .txt, the exported file including multimedia tags is not ready for being used in an interrogation or concordance tools, and also requires some further work to possibly make transcript and audio analysable.

It is divided into five main areas: 1) the segment panel is a left region displaying three kinds of segments in three separate columns: the left-most column shows story boundaries, the middle one shows speaker segments and the right-most shows sentence units (statements, questions, incomplete, backchannel or unassigned); 2) the transcript panel is a region showing the transcribed text; 3) the speaker panel is a region showing the list of all speakers assigned to segments in the file; 4) the widget panel features several shortcut buttons for doing speaker verification and other quality checks; 5) the audio panel including the waveform display and audio control buttons.

Fig. 2 What XTrans looks like

The above mentioned divisions make the tool easily usable and intuitive, especially in the case of sentence units that would allow us to visually distinguish between doctors’ questions and patients’ answers, and overlaps are usefully displayed in the waveform. The reverse of the coin, however, is that overlaps are not so easily displayed in the transcript panel, where one should find alternative ways to represent them (i.e. using Jefferson Transcription Notation).

This screenshot shows you what XTrans looks like. The transcript is over the waveform, where overlapping speech regions are displayed using overlapping color-coded horizontal bars for each segment, one per speaker. Annotation has here been made following Jeffersonian conventions, but the same could be done with TEI-XML, as the tool does not impose or suggest any specific system.

The SpeechIndexer (Windows 7 version, last updated on 21 July 2010) is a software developed by Ulrike Glavitsch at Zurich Federal Institute of Technology (ETH) which allows semi-automatic correlation of recorded speech with its textual transcription.

Originally designed for documenting the endangered Formosan Aboriginal languages (Szakos & Glavitsch, 2004), this tool can be used to segment the audio signal of whatever language (it automatically detects talk and silences), to transcribe it (by entering a pre-existing transcription or by doing it from scratch through a simple editor), and to index the audio (so that text segments are linked to audio segments. Thus one may string-search the database according to morphemes, grammatical tags, etc. and listen to the corresponding audio).

SpeechIndexer is quite simple to use if compared to Exmaralda, and it has the huge advantage of segmenting the waveform automatically, and of aligning text and audio without changing the text file, since an index file is generated to link text and audio without altering neither of them. Transcribing from scratch is however not as simple as it may seem, hence I strongly advise you to transcribe by other means and then upload your transcription in .txt format. Furthermore, the software is still in its infancy, and this often leads to bugs and inconsistencies that may be, but are still not, solved.

Fig. 3 What SpeechIndexer looks like

This screenshot shows you what the SpeechIndexer looks like. The waveform is at the top, the transcript is at the bottom and the navigation buttons are between the two (one simply has to put the cursor on one of these buttons to see the keyboard shortcuts). The underlined words in red and italics at the top of the editor window are linked (indexed) to their corresponding audio segment. Thanks to this function, when using the concordancing programme SpeechConcordancer which is also provided with the SpeechIndexer, one may listen to the pronunciation of the words, or morphemes, queried.

Problems arise, however, when you are dealing with the transcription of interactions involving more than one speaker. As you may see at the bottom of Figure 4, two utterances by two different speakers are treated as one single knot of speech, and consequently indexed as if they were following one another when they are in reality occurring simultaneously. Although both Szakos and Glavitsch proved sensitive to this problem, and stated they were going to think about how to implement the tool to meet the needs of researchers dealing with interaction and dialogue interpreting, SpeechIndexer is at the moment only really suitable for transcribing and indexing monologues.

Exmaralda (Partitur-Editor 1.4.5, Coma 1.7, EXAKT 0.8) is a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and for the construction and analysis of spoken language corpora, which was developed at the SFB 538 Mehrsprachigkeit of the University of Hamburg.

Exmaralda uses a time-based data model which is “very similar, if not largely identical, to the data models used by such tools as Praat, ELAN, the TASX annotator, or ANVIL” (Rohlfing et al., 2006: 109), which makes it relatively easy to exchange data to and from these systems. In addition to interoperability, the other distinctive features of Exmaralda are the attention paid to the visual representation of synchronicity in score format (see Meyer, 1998: 73-74 on the advantages of score notation), and the linkage between audio and transcript. Just like Transana and Xtrans, Exmaralda also includes the time codes in the transcript file, but the main difference is that Exmaralda allows for exportations in multiple formats, including a .txt without time-code.

Despite what is stated in manuals and in Rohlfing et al. (2006: 111), namely that “typical Exmaralda users are non expert computer users”, I found it quite complex to use its basic functions. The video tutorial available in German ( eases the learning process, but familiarization with the system’s principles requires much more time if compared to the other tools I took into consideration.

Fig. 4 What the keyboard added to Exmaralda looks like

However, the two scholars who are actually in charge of software development and dissemination, Thomas Schmidt and Bernd Meyer, are extremely helpful to those who want to use Exmaralda. In my specific case, Schmidt implemented the tool adding my mark-up to the keyboard, so that I simply had to click one button to insert things such as pauses, intonation pattern or non-verbal features into the partitur editor.

Furthermore, Schmidt not only made me dream of a possible integration between time-based and hierarchical data models (see Schmidt, 2005), but he also made this dream come true (with all the bugs and limits of the case) by implementing the tool with a special export filter that was called Modena TEI (on which more information is to come in a few days).

The main drawback of Exmaralda is its complexity, which may hinder its use by many researchers. The fact remains, however, that Exmaralda was used to build the only corpus of Public Service/Community interpreter mediated interactions that is, to my and Zanettin’s (2009) knowledge, so far available on the internet (the DiK-corpus on interpreting in hospitals can be accessed through a password that is given upon request), which makes it impossible not to take it into consideration. Furthermore, time alignment is done consistently during the transcription process by using the “append interval” button, and the opportunity to export the transcript in many different formats enables the transcriber to use it for many different purposes and with many different interrogation and concordance tools.

Fig. 5 What Exmaralda looks like

This screenshot shows you what EXMARaLDA looks like. The waveform is at the top, the transcript is in the middle and what you see at the bottom is a keyboard enabling one to add symbols. My annotation is here made following Schmidt’s (2005) conventions for conversion between the Exmaralda Basic-Transcription format and TEI format. Overlaps are visually and intuitively displayed, and additional information such as tempo (getting faster) and laughs (ride) are provided in separate tiers.

Before bringing this overview to a close, a final set of considerations arise from the software Dragon and Xaira, which I also happened to consider.

Among the decisions to make at the design stage of a project involving transcriptions, one is whether to use VR [Voice Recognition] software to ease and speed up the transcription procedures. Dragon is the best voice recognition software available. It works well if properly trained but it may have problems in managing the continuous and quick switching from one language to the other which is typical of community interpreting. To make matters worse, it unfortunately deals only with two languages at a time, either with Italian and English or Italian and French, which would have required purchase of two pieces of software for my project, which initially involved the three languages. In view of these drawbacks, I finally decided not to use Dragon.

Having excluded Dragon, the only thing I was sure of from the onset of my PhD project, was that I may have used Xaira (Dodd, 2009) to analyse my data. Xaira is a general purpose XML search engine which operates on any corpus of well-formed XML documents, and which is best used with TEI-conformant documents. Through the use of a CSS stylesheet, the “apparent dichotomy between machine-friendly and reader-friendly formats” (Cencini & Aston, 2002: 57) can be solved, and one can end up with easily and intuitively readable displays.

Fig. 6 What Xaira looks like

This screenshot shows you what Xaira can look like when using a stylesheet, namely a set of rules that determine how the information encoded in the XML mark-up is translated in a visual display. Stylesheets allow one to match computer readability with a user-friendly visualization, where an XML string such as <prol>docteur</prol> can be turned into something Jefferson-like (docteur: )

Given that there appears to be no good transcription tool that supports TEI natively, I needed a software which could help me with the transcribing activity and which could also allow me to export and easily convert transcripts into TEI-XML. Which is what next post is all about.

Table 1 Comparing 4 transcription tools

Before you have a look at the summarizing table, where Yes and No speak more than a thousand words, let me remind you that each transcriber must weigh these different and often conflicting elements for herself or himself, in the context of the specific goals of their particular transcripts.

Having already laid out some of tools benefits and shortcomings, I conclude by advocating that whichever tools are used, transcribers must give attention to the features that may prove useful in the context of their specific research objectives. And since I share Bucholtz’s belief that “a truly reflexive transcription practice will involve a discussion both of the choices we make and of their  limitations” (2000: 1462), I briefly summarize decisions I have made so far for the purpose of my project on community interpreting.

To convince others and myself of the fact that Exmaralda may be the best possible solution, I would recall a few advantages you may also infer from Table 1 above. Exmaralda eases the transcribing activity (once you get your teeth into it, which may take a long time in comparison to other tools), links audio with transcript, allows for the transcription of words spoken, prosodic elements, pauses and events, allows you to add meta information on speakers, activity, languages, and code switch, provides a nice visualization of overlaps, addresses the issue of data sharing, and above all allows for multiple exports in different formats so that one is not bound to use the programme default visualization (called partitur) and query device (called Exact). Transcriptions can be output in .rtf, so as to visualize the staves along with basic information on the project and on the speakers also in computers which do not have the programme installed. They can be output in .html for data sharing over the internet, allowing one to visualize the transcripts either in staves or in turns (what is called segment chain list) while listening to the audio. This .html layout also turns out correct in a plain .txt file, but one may also use other default .txt exports (like the GATT one) or ask for a .txt exportation that is suitable for the upload and indexation in the SpeechIndexer or in other concordance tools. As for .xml exports for the purpose of corpus building and querying, they are also provided by default, but one needs to give Exmaralda developers a clear idea of what the export should look like, so that a customized export filter can be relatively easily added to the tool. And here resides the last and possibly most important feature of Exmaralda: the kind  and helpful availability of the people in charge of software development and dissemination.

As for the limitations to equally be aware of, Exmaralda is complex, it is a stave transcription and hence may not allow you to easily observe turn patterns as a turn-based transcript would enable you to do, it is far away from the Jeffersonian style turn transcriptions many researchers still prefer, and it may make it more difficult to print out transcripts for academic purposes (i.e. a paper).

But if we want to go back to orality, which is the true nature of language and language learning, then we should also have the courage to look for new and unconventional ways of sharing results and outdistancing printed methods. We may follow, just to give you an example, Hepburn’s (2004) suggestion on how to turn a published article and its transcriptions into an available resource for others.

Some parts of the sound files will be made available over the Internet to provide backup for the transcription suggestions, to allow for future refinements, and to make this article a resource for others wishing to transcribe crying (Hepburn, 2004: 257).

Or we may take a step forward, and envisage sharing transcripts and analysis before even publishing them, as Chris Anderson pioneeringly did on his blog with what has then become a best seller. “The long tail” (2006) was first uploaded in a draft version on the internet, it grew thanks to the feedbacks of the users, and was eventually published in a book that sold so many copies it proved that the availability of the electronic version online does not limit, but rather boosts, the sales of the printed book.

While not claiming to be such an attempt, this post has however the merit of unconventionally sharing the experience I got while becoming an “expert” of some tools. As such it may be of some use for transcription users and makers, and may hopefully grew with your comments and feedbacks.

References may be found on the Reference page:

2 Responses to “Transcription tools”

  1. Natacha scrive:

    Listen to an interesting BBC Radio 4 programme about Transcribing spoken words:

  2. joyce scrive:

    as a new transcriber trying to do research on the best tools that i can use during this process ifind this information very helpful and i must say i have gained alot bravo!

Leave a Reply