Spoken Hebrew interview corpus (recorded 2018-2020)

Ref. 2346

Datensatzübersicht

Datensatz-Titel

Spoken Hebrew interview corpus (recorded 2018-2020)

Sprache der Datensatzbeschreibung

English

Datensatzbeschreibung

This data collection consists of interviews in Hebrew that were recorded by Philipp Striedl between summer 2018 and early 2020 for the dissertation "Representations of Variation in Modern Hebrew in Israel: Cognitive Processes of Social and Linguistic Categorization" (Striedl 2022). Process and structure of the data collection are described in detail in Chapter 3 of the dissertation. Further notes about transcription conventions are included at the beginning of the text on pages xxi and xxii. Striedl, P. (2022). Representations of variation in Modern Hebrew in Israel: Cognitive processes of social and linguistic categorization [PhD Thesis, LMU Munich]. https://doi.org/10.5282/edoc.29853

Bemerkungen zur Dokumentation

File structure: - metadata.csv is a summary of metadata for recordings and transcripts: Column "speaker" contains the research partners' unique identifier. Columns "residence_district" to "main_language" contain demographic information about research partners such as age, gender and country of birth. Columns "additional_tasks" to "percent_transcribed" contain information about the state of the recordings and transcripts. Column "additional_tasks" lists additional tasks that were inlcuded in the recording: Some research partners recounted the frogstory and 21 completed GERT (Group elicitation and rating task). Columns "percent_segmented" and "percent_transcribed" indicate approximately to what percentage the transcripts are complete. --------------------------------------------------------------------------------------------------------------------------- - directory "GERT" contains scans of the filled out forms that were used in the elicitation task and a summary of the data "GERT_summary.csv". The summary file includes information that was extracted from the filled out forms and analytical classification. The method that I used to summarise and analyse the GERT data is described in Striedl (2022:177-194). --------------------------------------------------------------------------------------------------------------------------- - directory "recordings" contains WAV audiofiles of the recorded interviews: Each recording event is named after its main research partner (e.g. a22m1l1). Recording events for a22m1l1 and j38m3l2 consist of several files because the recording was interrupted and continued shortly after. Some audiofiles were merged into one from multiple audiofiles belonging to the same recording event. The position where they were merged is marked in the transcripts. --------------------------------------------------------------------------------------------------------------------------- - directory "transcripts" contains eaf files with ELAN (2022) transcripts of the recorded interview and a template (template.etf) that was used to create new transcript files in ELAN. Transcripts contain at least two tiers, the interviewer assigned as PS (Philipp Striedl) and the interviewee assigned with a siglum (e.g. a20f2l2) as explained in Striedl (2022: 143). These two tiers contain transcripts in Hebrew orthography and occasionally Arabic and English, in accordance with the language spoken by the participants. "לבר " stands for Hebrew 'lo barur' (not clear) and was used in segments which were not intelligible during transcription. Additional dependent tiers contain meta annotations. They are marked by prefixes to the names of the basic tiers: "ci" stands for "content index" and "a20f2l2-ci" refers to the content of the basic transcription tier a20f2l2. All "ci-"tiers are used to mark specific content of the interviews, e.g. $Q1 refers to question one of the guideline and marks the time when it is posed. "DK" stands for "declarative knowledge" and contains codes for specific types of .... "PD" stands for "production data" and is used to mark linguistic structure such as $MS (Morpho-syntax). "TR" stands for "translation" and contains English translations of the independent tiers. Some early transcripts just contain the speaker tiers and one tier "notes" or "memo" for analytic codes (in German). All codes and comments are shared because they are part of the analytical process and may be helpful for researchers who are working with a similar methods. The use of analytic codes in the transcripts is not consistant and wasn't intended for any quantitative analysis. Usually, abbreviated codes following a "$" were used to mark interesting segments. Abbreviated codes are often combined with German or English comments. Recurring abbreviated codes are explained in the file codes_explained.pdf.

Versionsnummer

1.0

Enddatum des Embargos

-

Publikationsdatum

02.08.2023

Hinweise zur Version

Minor adjustments to file

Bibliografische Zitierung

Striedl, P. (2023). Spoken Hebrew interview corpus (recorded 2018-2020) (Version 1.0.0) [Data set]. LaRS - Language Repository of Switzerland. https://doi.org/10.48656/yx08-nb97

MD5-Hash des DIP

d69c135e602526e58f2b7a5335190b86

Inhalt des Datensatzes

swissubase_2346_1_1.zip