Code-switching and types of multilingual communities

Вид работы:

Дипломная (ВКР)
Предмет:

Информационное обеспечение, программирование
Язык:

Английский
,
Формат файла:
MS Word

35,99 Кб
Опубликовано:

2015-12-30

Все дипломные работы по информационному обеспечению

Скачать дипломную работу Читать текст online Заказать дипломную
*Помощь в написании! Посмотреть все дипломные работы

Вы можете узнать стоимость помощи в написании студенческой работы.

Code-switching and types of multilingual communities

Abstract

Code-switching has recently become a very popular topic for research in linguistics. However, for the lack of a tool allowing to analyze such a phenomenon on big amounts of data many questions stay unanswered. This work focuses on creating a set of rules for automatic annotation of texts generated by multilingual speakers in order to develop a prototype of a corpus that will grant more precise and extensive analyses of data containing cases of code-switching. The project consists of research of existing papers on code-switching, working out the main features of the marking, building and annotating a code-switching corpus based on data collected for the corpus of Udmurt language (#"justify">1.Introduction

1.1 Background

There are around 6500 languages on the planet. Over half of the population of the Earth is at least bilingual; many people are trilingual and multilingual. There are a number of reasons for such a situation, many are geographical, as people in one settlement learn the languages of the closest villages for communication, it is common for parents to speak different languages to their children, many kids speak one language at home and another at school and so on. It will not be exaggerating to claim that bilingualism is more of a norm than an exception to the rule (Golovko 2001)., there are many communities where people speak the same few languages. These conditions cause many changes in grammar and vocabulary due to language contact, but it also makes the speakers intentionally and unintentionally mix those languages in their speech by, for instance, starting a sentence in one language and finishing it in another. Such code-mixing or code-switching is a very common phenomenon, it has been widely studied on the basis of many different languages. This paper will be using code-mixing term rather that code-switching for the reasons suggested in (Muysken 2000) and discussed in the next part of this work.there are various situations in which people tend to learn more than one language there are a few conditions in which they do so, including bilingual acquisition with differing areas of application, second language learning (possibly leading to incipient bilingualism) at any age, balanced bilingualism (a child learns both languages to an equal level, often due to parents speaking different languages). In addition to favorable conditions of people mixing languages out of surplus, there are conditions which actually force people to insert words from other language; it can be due to the one of the languages being on the verge of extinction and lacking many words and expressions or simply due to various diseases and conditions, such as aphasia or dementia.all these conditions are not novelty, code-mixing is a relatively new area of research.1960s Meri Lehtinen and Michael Clyne on the basis of a small Finnish/English corpus made the first attempt to figure out if there are certain patterns in the way the speaker chooses the language; they have also assumed that the switch can only occur when there are similarities in the surface grammar of two languages and that only the words belonging to open classes could switch (Lehtinen 1966; Clyne 1967). Up until 1970s the cases of alternations between languages in the course of a discourse were mainly considered linguistic rubbish and were dismissed as random (Labov 1972; Lance 1975; Weinreich 1953/1968), although today it is universally recognized as grammatically constrained. There are a few early works, describing various restrictions on switches within particular grammatical constructions (Gumperz 1976/1982), (Timm 1975), but in the end of the 1970s a few papers (Pfaff 1975, 1976; Poplack 1978/1981) revealed some regular code-mixing constraints, which led to figuring out more definite rules and limitations regarding code-mixing in different language pairs. Soon this topic was taken up by other linguists.

Despite the newness of this area of research, quite a lot of work has already been done. Code-mixing is traditionally studied using one of the three approaches: psycholinguistic (Grosjen 1982; Kolers 1966; Lipski 1978), sociolinguistic (Gumperz 1982; Finlayson, Calteaux, Myers-Scotton 1998; Heller 1992) or linguistic (Poplack 1980), (Myers-Scotton 1993) (Muysken 2000), (Sridhar, Sridhar 1980). For my research I will mostly be concentrating on the linguistic approach, although hopefully the results of the work will be helpful for every aspect of studying code-mixing.approach accounts for structural questions, such as research in the field of morphology and syntax. The main goal is figuring out whether code-mixing obeys any rules. For instance, (Poplack, 1980, 1981) and (Sankoff, Poplack, 1981) deduce two constraints on code-mixing. One states that morphology of different languages cannot be mixed within the boundaries of one word. The other one suggests that syntactic structures of two languages have to be equal for switch to occur. However, although first assumed to be such, both these constraints turned out not to be universal.may be paradoxical, but despite quite a lot of researches conducted, there is still very little data collected for such purposes and therefore most papers are only able to describe the situation for two (or more) particular languages, but there is no way to analyze a bigger picture. There are a few rather big corpora of examples for various languages (Spanish/English (Poplack 1980), Italian/French (DiSciullo, Muysken, Singh 1986), Maroccan Arabic/French (Bentahila, Davies 1983), etc.). However, the corpus method is usually dismissed by most people doing research in code-mixing (Milroy, Muysken 1994), mostly for the costs of collecting data. Therefore, for the lack of tools for automatic processing of texts with code-mixing, most of the research is conducted manually; thus with no way to analyze such a phenomenon on a big scale, many questions stay unanswered.objective obstacle is, obviously, that code-mixing mostly occurs in spontaneous speech and the cases that appear in fiction are based entirely on authors intuition, thus the generally accepted idea is that corpora have to consist of spoken conversations. Furthermore, most code-mixing hypotheses cannot be proved or turned down by informants, as they cannot be checked with intuition (Muysken 2000).

Nevertheless, the Internet allows collecting incredible amounts of data such as public blogs, twitter, etc. (Dorleijn, Nortier 2009) They are much closer to spoken conversations than traditional literature and often include huge amounts of code-mixing examples and therefore can be used for a research. It is important to note that I in no way want to acknowledge written blogs to be equal to spoken conversation, but I believe that it is more than worthy data that can be analyzed and potentially be a huge step towards understanding the rules of code-mixing in general.

In this paper I have tried to look over most of the major researches describing various rules and constrains to code-mixing. I went over the types of code-switching; insertion, alternation and congruent lexicalization in particular, and discussed what should fall under each type. In order to create a unified system for analyzing code-mixing phenomena I have worked out basic principles for annotation, which, with very few modifications, can be applied to any language pair (or more than a pair). I have considered which conditions might influence the choice of various code-mixing patterns and pointed out the problems that might come up when studying certain phenomena in different languages. Based on the developed principle I have created the first version of an Udmurt/Russian online annotated corpus. The corpus consists of Internet blogs and has both morphological and code-mixing annotation. Based on the obtained data I have been able to determine the main strategies of Udmurt -> Russian and some Russian -> Udmurt code-mixing and pointed out some problems that occur in automatic code-mixing annotation when it is applied to language that have been in contact for a long time. I have checked if any of the known code-mixing constraints were violated.

In addition, I have developed a plan for possible future research based on this new annotation that can be used to conduct further research and significantly extend the potential of typological approach in code-mixing and maybe even become a step towards figuring out how to induce, manipulate, and replicate natural code-mixing (Gullberg, Indefrey, Muysken 2009).

1.2 What is code-mixing?

The term code-switching came from physical sciences (Fano 1950), then shifted to political anthropology (Gal 1987, 1995), the meaning of the notion has changed, and multiplied. In research on bilingualism and bilingual behavior in particular however it came as switching code. This was the term that was first used for what we call code-switching or code-mixing today. The topic were taken up in structural phonology, information theory, and research on bilingualism. In 1952 Jackobson gave the start to its synthesis (Jakobson, Fant and Halle 1952). His work is based on (Fano 1950), a paper in information theory and (Fries & Pike 1949) in phonemic systems, who suggest that two or more phonemic systems may coexist in the speech of a monolingual (1949:29).about the same time, Hoijer (1948) introduced a concept of phonemic alteration (parallel to what today is called borrowing) and phonemic alternation (parallel to code-mixing).

(Jakobson, Fant and Halle 1952) and later (Jakobson 1961) describe the notion of switching code in terms of the decoding that bilingual speaker must do to understand another persons code or to produce their own. As an example they present the situation of Russian aristocracy of 19th century that was switching between Russian and French constantly, sometimes within a single sentence (Jakobson, Fant and Halle 1952:603-604) .work also states that Two styles of the same language may have divergent codes and be deliberately interlinked within one utterance or even one sentence (Jakobson, Fant and Halle 1952:604). Interestingly, it formulates that every language is not a code, but that it has a code (Alvarez-Cáccamo 1998)., code-switching is conceptualized as the alternation not only of languages, but also of dialects, styles, prosodic registers, paralinguistic cues, etc, subjects later discussed in (Gumperz 1982), (Gumperz 1992) and (Auer 1992).(Muysken 2000) proposes the term code-mixing for the general notion of alternation of various code in a language and suggests to reserve code-switching to the the rapid succession of several languages in a single speech event. He however uses switch and switching when referring to particular co-occurrence of elements in different languages in a sentence. I am going to take up his terminology due to its additional accuracy and transparency.

The question of choosing the correct term for this phenomenon is important to understand the border line of what is being discussed. However, even after deciding on the word there are still very different opinions on what code-mixing represents. There are suggestions that code-mixed fragments of speech should be considered a new single code, sort of a new language. It is not unreasonable, because the speakers do not rely on any grammatical distinctions between languages as something significant (Gardner-Chloros, 1991). However, the fact that the switching in possible in particular conditions the grammars should be taken into account. I can also suppose that for particular types of languages these conditions are very similar. However, I have to agree with Gardner-Chloros in regard of us dealing with a single speech flow. The switches may occur many times in one sentence, leaving us to wonder whether a bilingual person has a certain bilingual system in their mind which allows them to switch between languages so easily.are a few works that support this hypothesis. (Swigart 1992) describes bilingual situation in Dakar, Senegal, where they speak Wolof and French and, interestingly, these languages are almost never used separately there:

(1)...xam nga weeru benn jour, quelques minutes lay def, quelques minutes rekk et puis c'est petit, un tout petit kii la! Boo gaawul, doo ko men a gis.

...you know the first day's moon, it's only there for a couple of minutes, just a couple of minutes and then it's small, it's a really small thing! If you are not quick, you won't be able to see it.

(Swigart 1992: 89-90)examples and the translations are taken from (Swigart 1992), the italics are mine.

(2) Mēn naa lakk olof sans lakk 'faranse'.

I can speak Wolof without speaking French.last example shows a person trying to prove that he can speak in pure Wolof, but still using French sans without. The irony of the example shows how sometimes the speaker cannot avoid code-mixing even when he aims for it. Obviously, there are less switches, but it seems to be hard to avoid it all together. (Golovko 2001) states that not every code switch is determined by the listener, especially in a bilingual community. So what Golovko suggests is to consider code-mixing from the point of view where orientation towards the listener is not obligatory or to stop viewing such phenomenon as code-switching at all. To solve it he proposes to introduce an opposition of motivated vs. unmotivated mixing.

A few papers (Backus 1993: 233; Sarhimaa 1999: 237) support the same approach. They claim that bilingual communities are characterized by fluid code-mixing, which is due to the unflagged (unmotivated) code-mixing.

Another supporting argument towards the existing of single mixed code was offered by Yael Maschler, when he worked on Hebrew-English language alternations of one bilingual speaker (Maschler 1998). The article demonstrates that some of the elements of the discourse make it a mixed code rather than code-switching, because they prove to have exclusive functions, not typical for either of the languages (to varying degrees, depending on the context and not all of the alternations were such, but any deviation proves the existence of some level of transformation.) It, however, only says something about the speech of one particular person and it does not say whether this transition is resident to Hebrew-English mixing in general. I believe, such research has not been conducted yet, although with comparison to some other transitioned code of a different language pair that would definitely be a very strong argument towards a more global distinction between code-switching and mixed code.

1.3 What is there to study?

mixing, studied as a sociolinguistic phenomenon, naturally is influenced by many extra-linguistic factors. Therefore, apart from strict pattern description study there are questions of whether anything other than restrictions of the language involved in the choice these mixing patterns or why do certain communities show one pattern rather than another.

There are two major ways to work on code-mixing. We can use a descriptive method and work on particular mixing strategies and look at the constraints to when they can occur in a certain language pair. Another approach is explanatory and requires making an attempt to account for the reasons of why and where mixing is possible. Although I would like to have a golden mean on that, I will mostly use the first approach with the goal of making it the path to getting a constructive theory explaining different features of code-mixing.

The main topic of the studies with linguistic approach is traditionally focused on whether there are any rules to how the switching between languages happens and if there are, whether any of them are universal.

2. Code-mixing

2.1 Classification

There are many ways to classify the types of code-mixing. One of the first classifications were proposed in (Clyne 1967), where he divided code-mixing into three forms, based on the notion of trigger-word that forces the speaker to switch to another language unintentionally: following (the switch occurs after the trigger-word), preceding (the switch takes place before the trigger-word) and combinative (the switch is realized between two trigger-words) (Clyne 1980; 2003)., a more common distinction is by (van Hout, Muysken 1994). They proposed partitioning code-switches into three other categories: insertion (the switches that are preceded and followed by the elements of another language), alternation (the switches from one language into another for more then one word) and congruent lexicalization (a few words in different languages that do not form one or multiple constituents).classification can be found in (Poplack 2000). She offers 4 types of code-mixing. The first one are single word insertions, which she classifies as nonce-borrowings. The second type includes bigger, established constituents, such as exclamations, particular phrases, idioms. The third type describes longer alternations for over one word inside one sentence and the fourth type characterizes the switches between the whole sentences.are more approaches to describing and working with code-mixing, including generative approach (MacSwan 1999a, 1999b) and classification of types of mixes in (Myers-Scotton 1988; 1989)are only going to be analyzing the code-switching within one sentence and therefore we are going to look at (van Hout, Muysken 1994)

.2 Insertion

2.2.1 Definition

Lets first consider insertion. The schematic way to describe it would be the following:

On this and subsequent pictures A and B are simply different languages. Thus the scheme here shows a clause in language A, and one or more constituents in language B inserted inside this clause., the reasonable step would be to admit that any switch to another language and back should be considered insertion. However, it is not exactly the case. First of all not every constituent gets involved in switches, and if it does than there is the most controversial question of whether it is a code-mixing insertion or a borrowing or another common term nonce-borrowing.equivalence constraint of which I will speak later, code-mixing is only allowed within the grammatical structures that exist in both languages. Therefore, almost any code-mixing involving noun phrases is of the insertional type. NPs are very well-defined constituents and mostly syntactically inert and because of that easily insertable. (Muysken 2009) suggests the following insertional types in terms of nominal constituents:

deduced that insertions most often involve single constituents; they exhibit an A B A structure, so that the fragment preceding and following the insertion are related grammatically:

These are also mostly content words rather than function words (Van Flout & Muysken 1994). Inserted items are mostly nouns, adjectives and verbs., insertions are usually single, nested, content words and morphologically integrated constituents and the grammar of the base language determines the overall structure of the sentence.

2.2.2 Code-mixing vs borrowing

One of the most controversial topics in code-mixing, especially while describing insertion, is what we should consider switched elements and what is rather just a borrowed word or an expression.

(Poplack 1980) suggests that there are nonce borrowings, where the word is only borrowed for one occasion, code-switching, when there is more than one word switched and established borrowing that can be found in a dictionary.are the features distinguishing between borrowing and code-switching according to (Muysken 2000):

As we can see the main criteria for distinguishing between code-mixing and borrowing demonstrate the level of adaptation of the word to the system of the main language, that includes phonetic, morphological and syntactic adaptation. They are, however not absolute, here is the division as can be found in (Poplack 1980a) in regard to Spanish/English code-mixing:

only partial adaptation takes place, we can talk about borrowability of the inserted words or the level to which it becomes part of the matrix language. This allows the line between loan words and code-mixing to be drawn at very different places. Therefore, as one of this works main goals is unification of code-mixing description, I will try to establish this line in regard to my best ability to reflect the difference while using automatic processing.

Thus, embedding an element from another language into a clause is code-mixing, but adding it to the lexicon is borrowing (Muysken 2000). However, if you consider this matter from the speakers perspective the distinction might be different. As they operate lexicons and grammars of both languages, they might view the lexicons as two subsets that intersect, so rather than borrowing the items from that intersection the process can be viewed as lexical sharing.Grammar however suggests to look at borrowing versus code-mixing topic within the dimensions or listedness (the level to which the word adapted within the language) and lexicality (supra-lexical/sublexical) (Goldberg 1995). Here is how it would be represented this way:

non-listedlistedsupra-lexicalspontaneous code-mixingconventionalized code-mixingsublexicalnonce-loansestablished loanscode-mixing is mostly spontaneous, there are certain patterns of mixing that are more common in one community rather than in the other even if they speak the same languages (Poplack & Sankoff 1988). Such code-mixing is classified as conventionalized. Established loans are naturally the ones that have long taken its firm position in the language. The nonce loans, the term introduced in (Haugen 1950) describes elements that are borrowed spontaneously and do not have any status in the receiving speech community. we have discussed before nouns and noun phrases are most easily borrowed and these borrowed nouns phrases can be complex (Sankoff, Poplack, and Vanniarajan 1990: 80). Although they still can be lexicalized, such combinations are not borrowed as easily in the language (weve already seen the NPs in the hierarchy). But its not only the length of the borrowed element that influences the borrowability, for instance, plural nouns tend to fall under code-mixing and N-insertions in particular, but rarely under nonce borrowing.on Haugens theory on loan words, (Muysken 2009) puts forward the following hierarchy of borrowability:> adjectives > verbs > prepositions > coordinating conjunctions > quantifiers > determiners > free pronouns > clitic pronouns > subordinating conjunctionshierarchy is derived thorough statistics only and no explanation is currently available. Moreover, it seems that this hierarchy is not universal for every language pair.we can say for certain however is that it is obvious that code-mixing that involves agglutinative languages is more predisposed to borrowing, as they are defined by the absence of lexical selection by affixes. There are no conjugation classes or any special morphophonemic rules, etc. As the affixes are non-selective, they always fall under equivalence, because a lexical base in one language is equivalent to one in another language.the same reasons fusional languages are highly resistant to borrowing (Budzhak-Jones 1998), (Budzhak-Jones and Poplack 1997). There are also very extreme noun/verb asymmetries in borrowability: nouns are often borrowed uninflected, but not verbs (Nortier & Schatz 1992). he asymmetry exists in agglutinative languages as well, but there are still many verbs borrowed.of all, it is important to point out that insertion is a different process in regard to different languages. It turned out to be so primarily because there is no agreement on what should be considered insertion and what is borrowing, but also because it depends on types of languages as we have already discussed with fusional and agglutinative distinction. As at this moment we cannot just choose a universal way to describe insertion in different languages; however, if we want to create a corpus we will have to have to deal with some generalization and naturally it is going to be towards simplification of automatic processing.

.2.3 Determining matrix language

(Haugen 1956:39) states the following:

Any item that occurs in speech must be a part of some language if it is to convey any meaning to the hearer...The real question is whether a given stretch of speech is to be assigned to one language or the other.

Therefore, when studying insertion, it is common to divide the languages of the discourse into matrix-language and embedded language (or languages). It is important to understand which elements are inserted into which language and which language the person is speaking at that moment to, so that we could first of all, make sure it is an insertional type of code-mixing, but also make more precise analysis of the occurring switch.are five ways that are being used to determine the base-language. First of all, we can just regard it as the language of the conversation (Berk-Seligson 1956:323). This idea seams intuitively plausible, however when languages are too mixed together it can be hard to determine, especially automatically, sometimes even the speakers themselves cannot say, which language was the main one in their speech. The second approach is by counting morphemes of the words that are uttered (Myers-Scotton (I993b: 68). This approach is more statistical. If we assume that the matrix language has the most words and morphemes. This model however does not take into account that some languages naturally have more morphemes in general. If we have a language pair of a polysynthetic language or an agglutinative one and an isolating one; in this case the amount of morphemes will mean almost nothing. A more psycholinguistic approach is seeing the matrix language as the language in which the speaker is more proficient. Proficiency cannot be a very reliable criterion though. In cases of balanced bilingualism it can also be hard to determine even by the speaker themselves. Moreover, different situations might provoke the speaker to use one language or another, depending for instance of who they are talking to. Another approach is from left to right (Doron 1983). Despite its disturbing simplicity it might be a good way of determining the matrix language. The trick is that determining the base language is only needed when analyzing insertion and if it is impossible to have an inserted word as the first one in the sentence, because if the rest of it is in one another language than it is alternation and if it is a mix of two languages that it is congruent lexicalization. This however is only a good approach if we look at the separate sentences and examples and not the conversation in whole. A much more worked over approach was introduced in (Milroy, Muysken 1994). They suggested using a structurally oriented model, where some element or a set of elements determine the matrix language. is also another approach with is based on a governmental model (DiSciullo, Muysken, and Singh 1986). It suggests that there is no single matrix language for a particular clause, but that every governing element in the sentence establishes a matrix structure. From this follows that unless the chain of government is broken, the language of the tree is determined by its highest element, which is usually a finite verb or in case of a subordinate clause it is the complementizer (Klavans 1985), (Troffers-Daller 1994).choosing the strategy or some compromise between them there is still another issue, which is determining which language the word belongs to. Sometimes if the languages are similar or have been in contact for a long time many words may be very similar; often the morphology can point to one language or the other, but if the languages are morphophonemically similar than it can be hard or even impossible to assign the word to a particular language.

. Congruent lexicalization

Similar to insertion, there is another type of code-mixing congruent lexicalization (van Hout and Muysken 1995). Its structure can be visualized as such:

insertion, congruent lexicalization involves several mixed-in constituents, sometimes so many that is is hard to determine the main language of the discourse determine to which language does syntactic structure belong, as grammatical relations between two languages interlace too tightly. This code-mixing pattern is common for second generation immigrants and bilinguals speaking closely related languages (Vakhtin, Golovko 2004:28). The reason for that is that congruent lexicaliztation results from frequent trigger words, therefore overabundance of homophonous words (especially in relative languages) can cause code-mixing. But even if there is no lexical correspondence categorial and linear equivalence is also a cause for congruent lexicaliztation. It is easily understandable, as this type of code-mixing is possible due to grammatical convergence. Vocabulary comes from two languages and the grammar structure belongs to both at the same time. Not the whole grammar has to be shared by both languages, often there is just alignment of the major constituents, but not all the internal structure of these constituents.many bilingual communities some structural convergence is commonplace, which raises many issues of language contact and language change and whether there is a causal link. The controversy brings out the question of whether if there is some connection is code-mixing for the convergence or is it the other way around; and does this convergence always mean reduction and simplification of both languages.

(Muysken 2009) grants us the hierarchy that he compiled that represents the degree to which congruent Iexicalization occurs in various communities in respect to Dutch:/Dutch/Dutchin Australia/EnglishMalay/Dutch/DutchArabic/Dutch/Dutchwe can see the pairs that are higher in the hierarchy can be regarded as intralinguistic variation.

. Alternation

Another very common strategy of code mixing is alternation. It can be represented on this scheme:

Although two language exist in one clause they remain separate. A good example of this type of code-mixing is actually the name of Poplacks article (Poplack 1980):

Sometimes Ill start a sentence in Spanish y termino en español

Sometimes Ill start a sentence in Spanish and finish it in Spanish (mistake is made by one of Poplacks informants)insertions where most embed elements are nouns and adjectives, alternation is often provoked by particles and adverbs (Muysken 2009), there are syntactical differences to be considered. Alternation are more likely to appear on the boundary of a major clause.

(Treffers-Daller 1994) contains a corpus French/Dutch code-mixing in Brussels, which is characterized by a high number of alternations. There are two important points that can be made on the basis of her data. The first one is that the alternation only occurs where the word order is the same and that it usually happens on the border of two major clauses (Muysken 2000). Based on her corpus, she also proposes hierarchy of probability of various constituents to be part of alternation. She also uses a probabilistic approach:NPs/PPs > dislocated NPs/PPs > adverbial PPs/NPs > before subordinate clauses > predicative NPs/APs/possessive PPs > subject or object NPs and clauses > indirect questionsis tightly tied to the syntax. The switches of any type can occur either in the center of the clause or on the periphery. The switch usually involve a left- or right-dislocated element or can be found in the beginning of the second of two conjoint clauses. when discussing alternation linguists mostly mean sentence-internal switching, mixing in in between the utterances is also entirely possible and alternational marking is theoretically substantiated; as in between the clauses the alternation occurs at the boundary and when the language is switched it remains the same:

(4) Adios, amigos! See you tomorrow.

Goodbye, friends! See you tomorrowusually the alternation is taking place on the boundary there is still possibly for it to be found in a connected structure under equivalence, the same as congruent lexicaliztation (Nait M'Barek & Sankoff 1944), (Poplack & Meechan 1995).

5. Constraints

code mixing text

5.1 Equivalence Constraint

I have mentioned before there is an equivalence constraint (S.Poplack, D.Sankoff 1981), existence of which is supported by many linguists. There are at least two ways to look at it, some believe that code-mixing should not violate syntactical structures of either language (also mentioned as switch-alpha constraint in (Choi 1999), others believe that a language assimilates into another one (dual structure principal in (S.N.Sridhar, K.K.Sridhar 1980); matrix language model in (C.Myers Scotton 1989); matrix language principle in (Kamwangamalu 1998). The latter follows the theory that the adopting language sets the aspect, tense, agreement, etc (Bhatt R.M., 1997). Whatever the mechanism is, it is clear that the speaker tries to avoid any grammatical conflicts when producing the utterances.

Consequently, this constraint states that a switch cannot occur within a constituent generated by a rule from one language if this rule does not exist in another and therefore neither can violate any syntactic rules.constraint may be demonstrated on one of the classical examples in (5) which were generated by Gingras (1974) and then tested on a group of Chicano bilinguals for acceptability.

(5) El MAN que CAME ayer WANTS JOHN comprar A CAR nuevo.: El hombre que vino ayer quire que John compre un coche nuevo: 'The man who came yesterday wants John to buy a new car.

(6) Tell Larry QUE SE CALLE LA BOCA: Dile a Larry que se calle la boca: Tell Larry to shut his mouth'. sentences have very similar structures; they both contain a verb phrase and a verb phrase complement, where both verbs, when used in English require infinitive complementizer rule apply to it, but is Spanish the same construction comes with a subjunctive complementizer. Although (5) has words switching almost every word, and (6) has a switch between two constituents, their biggest difference is in regard to linear equivalence (Gingras 1974). By using infinitive complementizer the first sentence violates the constraint as it is not a Spanish construction. The first half of the sentence is compiled out of constants that do not go against any rules; English and Spanish map on each other perfectly there:MAN que CAME ayer WANTS…man who came yesterday wants …hombre que vino ayer quire…the switch may occur anywhere, but not further. Thus, all Gingras informants found the full sentence unacceptable., A CAR nuevo doesnt follow the English adjective-noun word order and although some Spanish adjectives may precede the noun, nuevo is not one of them. When structures are not equivalent in two languages the constituents tend to be uttered in one of the languages as in the second example. This sentence was found acceptable by 94% of Gingras' informants.has been verified as a tendency in many language pairs: Spanish/English (Poplack 1980), Finnish/English (Poplack et al. 1987), French/Arabic (Naït MBarek & Sankoff 1988), English/Tamil (Sankoff et al. 1990), Wolof/French and Fongbe/French (Poplack & Meechen 1995), Ukranian/English (Budzhak-Jones 1995), French/English (Turpin 1998) and possibly more.

Nevertheless, (Di Sciullo, Muysken and Singh 1986) disagrees with this constraint. They argue that it does not include any notions of structural or hierarchical relations (which most grammatical principles are built on) and only relies on linear sequence.

Thus (Di Sciullo, Muysken and Singh 1986) suggests an alternative description of the equivalence constraint that involves the notion of government. Lets consider verbs, adpositions, etc (governing elements) and noun phrases as governed elements. Each category is perceived by the speaker as equivalent. The linear equivalence should be viewed as a subclass of categorial equivalence and the governed elements (e.g. noun phrases) must be perceived by the speakers as equivalent. Linear equivalence is simply it subcase of categorial equivalence, under the government theory, as for instance the rightward government verb is not exactly equivalent to a leftward government, etc. According to this government constraint switching is possible only between elements that are not related to government (for example in PP the preposition governs the NP and in VP the verb governs the object). They claim that this constraint is priortized over every other. It has been proven on for French/English/Italian code-mixing, as well as Hindi/English., it has also has been argued against by (Klavans 1985), as it claims that simple, certainly frequently occurring examples are impossible, such as switching between V and Obj.NP:

(7) Los hombres comieron the sandwiches

The men ate (Spanish) the sandwiches (English). at the same time it allows very rare example such as:

(8) La plupart des canadiens scrivono c.

The majority of (the) Canadians (French) write (Italian) c. defend itself (Di Sciullo, Muysken and Singh 1986) however states that although they claim that government constraint is universal it can be inflicted with additional constraints.

5.2 Free-Morpheme Constraint

Another constraint that seems to be much less frequently violated is the free-morpheme constraint, which basically precludes such formations:

(Clyne 1980) cites a few instances that show free-morpheme violation, even though Clyne does state that these examples are very rare in their corpus.

(9) Thats what Papschi mein -s to say.

Thats what Papschi means to say.name Papschi is pronounced with German morphology. It can be a trigger, but either way this utterance contains two switches from English to German and then from German back to English (possibly because of noting the previous switch). The second switch occurs within the word.similar situation can be found in the following example (10); with another morpheme:

(10) in meine Mutter -s car.

In my mothers car.very similar switch at possessive morpheme can be seen in this English/Dutch example (13).

(13) naar mijn vriendins place

At my girlfriends placeeven more interesting switch for just a single morpheme but each time it occurs:

(14) Es waren hundert-s und hundert-s of Leute.

It was hundreds and hundreds of peoplegives another very unusual example:

(15) Dan somstimes go voorn hour nog in bed.

Then sometimes go for an hour to bed., Dutch som(s) already means sometimes. However, the switch is probably triggered by this word, which proves the possibility of triggering by the ambivalence of the word not only for a separate word, but within one as well.

The existence of this constraint is widely discussed, some argue for it (Poplack 1980), some against (Clyne 1987; 2003),(Berruto 2005), but it seems that even if it is violated at times it is not a norm, and as it happens in the speech of particular speakers we cannot claim that it is typical for any particular community or language pair.

5.3 Closed-Class Constraint

In addition there is a constraint on all the constraints deduced in (Joshi 1983) and soon taken up by (Doron 1983), which limits the switching of closed-class elements, such as, quantifiers, tense morphemes, complementizers, pronouns, prepositions, determiners and other is they exist in the language. The constraint was worked out on the basis of Marathi/English code-mixing, but to some extend has been proven on many other language pairs.

5.4 Language-Specific Constraints

In addition to constraints carried over to supposedly every language pair, there are also a few language specific constraints. For example, a widely studied Spanish/English data suggests that there is a constraint that prohibits switching between noun and following modifying advective (Woolford 1983).

(16) *the casa big the house big , it seems that the switch of a clitic pronoun can sometimes occur:

(17) Yo it compré. I it bought. also suggest the restriction on switches that include verbs with empty subjects and auxiliaries with some negatives.

(18) *Was training para pelear. ...to fight

(19)*I am no terca. ...stubborn supposes that this constraint exists due to language-specific transformation rather than a language-specific phrase-structure rules, which is is another reason for a need of more language pairs corpora, as Spanish and English are both SVO. And according to (Klavans 1985) conflicts in code-mixing cannot be explained through constraints when in comes to differently structured language pairs. , the work on Hindi/English (SOV vs SVO) code-mixing( (Di Sciullo, Muysken and Singh 1986) suggests that such a pair might be constraint due to the more of a Hindlish structure of the discourse (Hindy with lexical transferal from English), something that we have discussed in this paper already in regard to English-Ukrainian code-mixing. They also state that language-specific constraints are complementaryy to the general constraints and do not override them.Hindi/English they observe the following constraints:

switching occurs differently between subject and verb and verb and object, plus the second is much rarer

-complements of a preposition must be in the same language as the preposition, as in sonata for two violins

-phrases inside a phrase structure tree must be in the same language

More constraints can be found in (Pfaff 1976). She, as well as (Wentz and McClure 1977) and (Timm 1975), sates that in Spanish/English code-mixing clitic pronoun object must always be in the same language as the governing verb. She notes that the mixies between Determiner + Noun are very rare (found ungrammatical in (Wentz and McClure 1977), as well as full clause switches, which are found frequent in (Gumperz 1976). She also dissagrees with Gumperz who claims that conjunctions are always in the same language as the conjoined sentence. She also found switches of prepositions to be impossible and of full pronoun phrases to be very rare.also formulates a semantic constraint (no one has supported it yet, but no one seems to have objected to it either), although semantic issues are often discussed in regard to code-mixing. She claims that the PP can switch if they are temporal or figurative, but not locative.

5.5 Summary

As we can see there are many controversial topics among scholars, those include government and free-morpheme constraints, whether code-mixing is surface or deep structure phenomenon and are there mixed grammar in some communities or is it always switching between separate grammars. regard of the non-language-specific constraints the evidence that we have looked at suggests that even if there are cases when they are violated the general tendency is follow them. Our current aim is to create a prototype of a resource that will be able to help in determining what are the conditions in which the constraints do not work, as well as give more accuracy in distinguishing between language pair specific patterns and speaker-specific. On the issue of mixed grammar vs. two separate grammars I want to point out that most bilingual environment (either a big community or just a family for instance) presumes convergence of languages through contact (even if this contact exists is only one persons mind), therefore the mixed grammar analysis is more plausible. This argument is strengthened by triggering, syntactic convergence, and syntactic transference. I have looked over all the constraints and discussed arguments for against them. After analyzing the data I myself cannot take any one of them as universal, but I can certainly accept the tendency that they represent. I will let the reader decide on whether they want to accept or reject each one of them, but I hope that the principles I have developed for annotation of multilingual texts the model of the corpus that has been built will be of a help in proving any of the reader theories regarding these or any other constraints for that matter.difficulties in the discussion on code-mixing constraints are however due to the unclear division between code-mixing and borrowing/transference/interference, as well as using the term ungrammatical for just a tendency. To create annotation principles I had to decide on the distinction between terminology and the border lines of the terms myself. The decisions and the reasons for them will be discussed in the next part.

6. Annotation principles

Based on what we know about code-mixing now, I have complied a list of things that should be annotated. This compilation is based on the assumption that there are only two languages being switched in the text that is being annotated. This approach is chosen simply for the purpose of simplification of explanation. With a few minor modifications the principles can be used for more languages used by the author (or speaker, if the corpus is recorded).of all it is important to annotate the every single word with its language. The language of the word can be determined with use of grammar dictionary, however it is false to assume that we can determine the language by looking solely at morphology. The word can be a nonce-borrowing and have a stem of one language, but morphology of another. If one considers nonce-loans a part of the language whose morphology it acquired than this point may not make too much sense, however it is important to make sure that all morphological markers belong to one language. Moreover, relying on the morphology of the word when determining its language can only suffice from the assumption that free-morpheme constraint is not violated. This however as we have seen is not always the case. This work is based on the annotation format of UniParser (Arkhangelskiy et al. 2012), in which only separate words are annotated, but not the sentences or clauses. Nonetheless, this should not stop us from marking the phenomena that involve multiple elements. If the switch involves a few constituents the first constituent should be marked as such, the rest should be annotated as following that one. Describing which elements are mixed-in should also allow to distinguish different directions of code-switching. For example, if one wants to search separately any alternation switches from L1 to L2 and the same from L2 to L1. annotating insertion a few strategies can be chosen, depending on different understanding of what it stands for. For this work the approach was chosen based on our capabilities of automatic annotation. It is however also one of the most popular approaches today. We consider insertion any occurrence of one or more words (in a row) of the language different from the matrix language which is inserted inside the sentence, meaning that it is not in the beginning or in the end of that sentence, when there is only single insertion in the sentence (otherwise it would qualify as congruent lexicalization). However if the inserted word or words exist in both languages than it should be considered borrowing rather than insertion. The tag marking of the length of the switch segment should also be marked on the first word, so that if someone considers all single word insertions they could disregard them. As for the matrix language, determining it is only needed for annotating insertion; we have decided to chose the left-to-right approach. It seams to be most suitable strategy in our situation, as the first word of the sentence in our annotation is always in the matrix language if the sentence contains insertion. When an element is inserted in the beginning of the sentence it can only be considered alternation or congruent lexicalization. Thus if inserted elements get the insertion marker, the matrix language is naturally the opposite. , when a word is in L1, but has L2 morphology markers within L2 context it should also be considered borrowing (one can also add a special marker to such cases), However if the context is mixed (preceding word in L1 and the following in L2) , it should be considered a code-switch in the middle of the word and therefore a violation the free-morpheme constraint. Such an example, although found ungrammatical was stated in (Budzhak-Jones 1995) where she studied English/Ukranian code-mixing:

(2) *So you go to a storu des iskupytysja

-M.Gensomewhereto-shop-Reflis much less controversial and easier to mark. Each word of L2 should get an alternation marker, the first word of the segment should also get the first marker. If the switch occurs on the word that is homonymous in two languages that word should be marked as a trigger.

When marking congruent lexicaliztation we see if there are a few insertions of another language in the sentence and then annotate every word with congruent lexicalization (in our case cong.lex) marker. As well as with other types of code-mixing we mark the first word in the sentence as such. It is also essential for future search queries. If for instance the user wants to find all occurrences of congruent lexicaliztation the engine will only search and show it in regard to the first word to avoid repetition. addition, someone might want to study code-mixing in regard to its syntactical constraints. To be able to do so some annotation of syntax is necessary. As equivalency constraint is still a rather controversial topic it is something that a corpus might help with (as we discussed it through the notion of government). If both languages have syntactic chunkers (shallow parsers, i. e. tools that identify NPs and other kinds of constituents), two annotations can be offered, so that not only the observance of the equivalence constraint in general can be checked, but more specific hypotheses as well. For instance, the potential examples offered in (Muysken 2000) .

a. V(Eng) NP(Sp) - 2

b. V(SP) a NP(Eng) - 1

c. *V(Eng) a NP(Sp) - 3

d. *V(Sp) NP(Eng) - 0, a chunker is not always available or easy to make. It however does not mean that without it no syntactic constraints and theories can be checked. An easy solution can be just looking at part of speech and grammatical characteristics:(Sp) + Noun (Eng + Acc).type of search can also reveal what code-mixing patterns are like when different languages have different variants within one grammatical category (for instance, 9 Russian cases vs 15 Udmurt cases).

7. Udmurt/Russian Code-Mixing Corpus

7.1 Why Udmurt/Russian?

After working out the principles I have started working on creating a prototype of a corpus to test them. This research initially came out of working on an online annotated corpus of Udmurt language that Timofey Arkhangelskiy and I have been building. While collecting texts and annotating them with morphological marking we have constantly come across Russian words and phrases inserted into Udmurt sentences which could not get annotation with the use of just an Udmurt parser. In fact in some texts the mixing occurred so often that we had to remove them from the corpus as they only soiled the data being left unannotated. That project was aimed at researching standard Udmurt and its dialects, so we did not try to work around code-mixing. However, we have realized that it a relatively common situation, especially when annotating blogs and social networks pages as people tend to use constructions, phrases and grammar of spoken language and more informal language in general, which includes switching between languages. All researches of this topic however turned out to be conducted with the help of manually collected (mostly recorded) and manually annotated corpora. Therefore, having all this invaluable internet data we have decided that there is an urgent need to create the way to annotate switches. We wanted to both be able to mark foreign words in relatively clean data, but also be able to work on code-mixing as a separate area of linguistics. So, out of pure enthusiasm of solving a problem combined with absence of any research of the topic in regard to these particular languages, we have started working on a corpus of Udmurt/Russian code-switching.

.2 General Remarks

Udmurt is a Finno-Ugric language of Uralic family. It is spoken by around 340,000 people in Russia according to the census of 2010. Udmurt together with Russian is an official language of Republic of Udmurtia. It is also rather widespread around Republics of Tatarstan, Bashkortostan, Mari El, as well as in Kirov, Perm, Sverdlovsk Regions in Russia. Udmurt has four major dialects: Northern, Beserman, Southern, and South-central. Furthermore, there are other transition and mixed dialects. The differences between the dialects are not very substantial and are mostly phonological rather than morphological (Alatyrev 1983).writing was created in 13th century on the basis of Russian graphics. However, Udmurt alphabet has only formed in the beginning of 20th century (therefore there are not many texts written before the middle of 20th century). Alphabet consists of 38 letters, 33 of which are Russian and the remaining 5 have diacritics - Ӵӵ, Ӟӟ, Ӝӝ, Öö, Ӥӥ.mostly located on the territory of Russia it is obvious that it has been under a great amount of influence, due to language contact, lack of prestige and history of suppression. Therefore, most people who speak Udmurt are bilingual. There have been a utterly recent raise in support of Udmurt language. A few electronic dictionaries have been compiled, a few Udmurt books reissued, but there are still not many resources to study Udmurt using linguistic approach. are a few grammar books describing Udmurt language, as well as printed and electronic dictionaries (on the basis of the printed copies). There is also a Corpus of Udmurt language created at University of Helsinki (Suihkonen 1998); it however has restricted access. A lot of work on this corpus is based on the work that we did on Udmurt corpus. that corpus and the one that has been created during this work were created by adaptation of the search system of Eastern Armenian National Corpus (EANC - #"justify">(21)лыдӟылытэмъськылыны

лыдӟ - ыл -ыт-эмъяськ- ыл - ыны

читать FREQ CAUS FICT FREQ INF

to often pretend that somebody is making someone read often

It has a similar but not entirely same syntax, for instance, Udmurt has many postpositions, whereas in Russian the same function is normally taken up by prepositions. As for the lexicon, the dictionary that I had to work with (Kirillova 2008) contains an enormous amount of Russian words, most of which are indeed established loans. That however, for the obvious, reasons causes trouble for automatic annotation of the language of each word.

7.3 What Can We Expect?

There are a few things that we can suppose about Udmurt/Russian code-switching by just analyzing its grammar, historical and geographical situation and what we know about code-mixing by now. First of all, we can assume that different authors prefer different code-mixing patterns, although within what this particular language pair can offer. Insertion however is usually more common in general, but we might have problem to distinguish between loans and code-mixing due to the long language contact. Russian and Udmurt have very similar word order and therefore there is possibility for congruent lexicalization. As we have collected blogs that are mostly in Udmurt, we expect most alternations to be switches from Udmurt to Russian, but a few cases the other way around should also occur. We will be checking both free-morpheme and equivalence constraints; however, we expect them to be upheld.

7.4 Corpus Contents

The texts used for the corpus are the ones available online, these are mostly blogs and post from social networks. Such texts are usually informal by nature. They represent practically a unique type of data. They have a merge of characteristics of both written and oral speech. Being already written down it offers a perfect opportunity for automatic language processing. In terms of code-mixing it can push forward the researches in sociolinguistics, psycholinguistics as well as synchronic and diachronic studies. There were a few studies in code-mixing conducted on Dutch-Moroccan and Dutch-Turkish internet sites by (Dorleijn, Nortier 2009) already and they have already proven how differing are the patterns that appear in different language pair. They however state that those differences are mostly due to the sociolinguistic situations as opposed to typological considerations., there is no real evidence that there is no drastic difference from typological perspective between some types of language pairings. It is clear that there are some, but nobody have been able to compile a big enough and diverse enough typological sample to properly analyze it.

The overall framework, pipeline and the tools that can be used to build corpora and make a morphological annotation have been developed in the course of realization of the Corpus linguistics fundamental research programme of the Russian Academy of Sciences. Within this programme, corpora of Buryat, Kalmyk, Lezgian, Ossetic, Tatar and other languages of Russia have been developed (see (Arkhangelskiy 2012). Common standards developed for these corpora include automated full morphological tagging, annotating the texts with metadata (including author, title, genre, year of creation, etc.), and providing the corpus with a publicly available online interface. The ideology and some of the tools used while building the corpora in focus originated in the project of the Eastern Armenian National Corpus (Daniel 2009). For a detailed description of the methods and tools by the example of the Ossetic National Corpus, see (Arkhangelskiy et al. 2012).

The Udmurt/Russian corpus uses the search platform initially designed for the Eastern Armenian National Corpus and then adapted for use with numerous other corpora. The platform allows the users to make complex queries, including searching for certain lemmata, grammatical tags, punctuation surrounding the words, combinations of words and more. The results can be sorted in random order or according to parameters like text title, token (in alphabetic order), etc.each instance found in the corpus, the user can see a single sentence where that instance was found. The result may be expanded to a maximum of a 3-sentence context. This constraint solves the dilemma of copyrighted materials in an open access corpus: while any particular sentence can be found in any text, there is no way the user can read or extract the whole text or any significant part of it from the corpus.

After collecting the texts we had to create a grammatical dictionary, which is complicated by the fact that Udmurt parts of speech are often hard to distinguish. As I had to use a dictionary where parts of speech were not marked, the decision on what part of speech a word belongs to was often made on the basis of Russian translation. As complexity of automatic annotations has increased drastically so had the amount of mistakes. Some words however had their parts of speech indicated in the dictionary (most particles and many conjunctions), and the verbs were easy to determine due to their unique infinite suffix -ны. Some lexemes though had to be removed from the dictionary completely, it contained many participles and gerunds, which are derived from the verb with the use of very productive suffixes and these forms can be generated automatically.dictionary and the grammar in the particular form is need for morphological parser. Automatic morphology annotation for Udmurt was conducted with the use of data from (Alatyrev 1983), (Perevoshchikov 1962), (Winkler 2001) grammars and dictionaries (Butolina 1942), (Kirillova 2008) in the form required by UniParser as it has been done by us previously for the Corpus of Udmurt Language (#"justify">s grammatical characteristics, paradigm of inflection and translation. English translation is only present in this paper for illustrative reasons, the corpus itself only contains translations into Russian.

-lexeme

lex: веднаськыны

stem: веднаськ.

gramm: V,I

paradigm: connect_verbs-1

trans_ru: заниматься колдовством to practice witchcraft

lang: udmgrammar dictionary whatever it is based on should be compiled in such a way that it does not contain (to the maximum possible level) the forms that can be generated. Therefore, such instances should be eliminated:

-lexeme

lex: веднаськытыны

stem: веднаськыт.

gramm: V,I

paradigm: connect_verbs-1

trans_ru: заставить заниматься колдовством to make someone practice witchcraft

lang: udm

The translations should be as short and as accurate as possible to fit on the screen and be laconic enough to let the user glance through it quickly. Sometimes if it is impossible to shorten the translation (e.g. due to many different meanings) properly automatically, even simple number of characters restriction is better than long bulk strings of words.with working on the dictionary we have to create paradigms for the inflexions. By combining stems with corresponding paradigms we generate all possible forms that the word can have. Each box of the paradigm has the following form:

-paradigm: Verb-pres-I-positive

flex: .э

gramm: 3,sg,pres,Ithese new forms are being searched in the texts and when found, the words get respective annotations, which means that if we, with the purpose of simplifying the paradigms, generate some surplus forms that do not actually exist they just will not be found in the texts. Although it is important to trace whether these forms are not homonymous with anything else, because then the paradigm has to be changed. it is often the case and many Russian forms of the same words are homonymous to each other, as well as Udmurt forms have homonyms within Udmurt language, and most unfortunately many Udmurt words are homonymous with Russian words, thus some words when we cannot easily resolve the homonymy automatically through syntax get a few annotations and in code-mixing corpus naturally two language markers. However, in code-mixing homonymy of the words in two different languages may become a trigger for a switch.I list some basic Udmurt and Russian linguistic characteristics, as some of them might be useful in regard to equivalence. Udmurt as opposed to Russian is an agglutinative language with mostly postposition agglutination. However, some there are some flexive elements. Udmurt verbs one of 4 tenses, 4 aspects, plural or singular, in one of 3 persons; they can be transitive and intransitive and have 2 conduction types. Verbs have negative and positive forms. Nouns can be singular or plural, in one of 15 cases as opposed to 10 of Russian (National Corpus of Russian Language). As in Russian it is possible to generate gerunds and adverbial participles.

7.5 Annotation of Code-Mixing in the Corpus

According to the principles that have been discussed, I have annotated the insertion, congruent lexicalization and alternation. Although there are some decisions that have been made for such an annotation for any language pair, such as some distinctions between insertions and borrowings and if one-word switch at the end of the sentence should be considered alternation. But there are also some language specific decisions that have to be made. First of all, during annotation as there is a lot of homonymy; and when it is significantly unbalanced in frequency, for instance in one language the word is a very common pronoun and in another a rather rarely used in general topics and in a particular corpus noun, then removing the latter from the dictionary actually improves the accuracy and functionality of the corpus. Another specific modification to the annotation that I had to do especially for Udmurt/Russian is Russian infinitive + карыны (udm.to_do),, which is fairly common and certainly productive, therefore is gets annotated as a construction borrowed into Udmurt.

7.5.2 Insertion

The corpus contains a very large amount of insertions. One of the most popular elements that are being inserted is a Russian conjunction и and between the clauses.

(22) Вуэ но тани со дорам куное и кутске ни аслаз мудрон кылъёсыныз мыным мадьыны.

There are also many interjections (23) and Russian idioms (24). One might argue that the latter is used for the lack of similar expression in Udmurt. Not being a native Udmurt speaker, I cannot make this statement; nevertheless, various psycholinguistic studies suggest that is often the case.

(23) Пыдйылам султыса гинэ мон шоди - туннэ мӥ вордиськом... аххааа, ну ти монэ валады, может öд но валалэ…

(24) Атае третий десяток пошёл шуыса шоккетӥз.

My original hypotheses was that the insertional code-mixing would be the most common. And although I did come across it rather often, it is sometimes very hard to distinguish between borrowings and insertion. Udmurt lexicon is so overfilled with Russian words that it can be hard to see when the loan is established or iа it has been brought into the sentence for just this particular occasion. The strategy that I have chosen for working around this problem, as you may remember, was to check if the word exists in both dictionaries. Some of the Russian loans that are included in the Udmurt dictionary however have Udmurt analogues that are much more widely used, which means that some of them should probably not be used for the annotation. The other major problem is that some Russian and Udmurt words that are homonymous in those forms that are not grammatically identical and that might lead to wrong output.of these cases I have tried to solve, not all of them can be solved automatically through verb government or word order though, consequently some mistakes still may occur, they however should be relatively easy to recognize manually.

7.5.3 Alternation

If a sentence starts in one language, switches into another at some point (once) and then finishes in that other language it should have got alternational annotation. It goes for switches for just one word as well if it is in the beginning or at the very end of the sentence. The first word after the switch gets a mark that it is indeed the first and how many words are there are in this language in this sentence. If the first word is homonymous in Udmurt and Russian it gets a trigger mark as well. This allows the user to find all the alternational switches and narrow it down to longer or shorter once if there is a need. As I have discussed before there are two types of alternations: central and peripheral. My annotation does not allow search for them separately (at least not as of today), although restriction on the length may narrow it down a little bit. Here are the examples of both from the corpus (chosen manually). Interestingly, although in general central alternation is more common in code-mixing, Udmurt/Russian seems to have an overwhelming superiority of peripheral alternations.alternation in the corpus:

(25) Нырысетӥ 200 страница вал напряжённой, интригующий женский роман.

The first 200 pages were an intense, intriguing womens novel.напряженной exists in both Russian and Udmurt, here grammatically it is in Udmurt here, as this is one of the cases when phonetically homonymous words in these two languages do not have the same morphological characteristics; however I believe that in this case напряженной could still possibly be a trigger for the switch.alternation in the corpus:

(26) А мы вообще не парились, но чай сектам.

And we didnt bother at all and treated them with tea.same as in previous example there is a word чай tea, that exaists in both languages, but grammatically we can assume that it is in Udmurt here. Но also exists in both languages, in both it can mean but. Even if we decide it is an Udmurt word, we should still consider it a trigger, due to the homonymy to a Russian conjunction.

7.5.4 Congruent lexicalization

Much more common than alternation in the Udmurt/Russian corpus is congruent lexicalization.

(27) И мыным тунсыко потылэ вал котькуд гужем, день военно-морского флота соос эшъёсыныз люкаськыса празновать карыло вал шуыса и вообще.

I was interested to go out that summer, on the day of the Navy forces, they came together with their friends to celebrateand everything.

The example (27) qualifies for congruent lexicalization, however note that праздновать карыло as I have discussed before is marked as borrowing due to the regularity of such formation (Russian infinitive + Udmurt to do) in Udmurt.very interesting example is (28), here the reader may see how often the author switches back and forth.

(28) Окно - со стекло прозрачное, адӟиськод, мар луэ со сьӧрын, а чтобы лэсьтыны сое зеркало и чтобы адӟыны астэ гинэ и не замечать, мар луэ вокруг стеклоез покрытьтоно сереброен.

A window is a transparent glass, you can see through it what's going on, and if you want to make a mirror out of it, to see just yourself, and not what's around, the glass has to be covered with silver.

Покрытьтоно is an interesting borrowing; causative suffix -тоно is attached to a Russian infinitive to cover.

7.6 Checking the Constraints

7.6.1 Equivalence Constraint

I have checked every sentence from the examples on violation of equivalence constraint and it seems like none of them show any deviation. The exception is relatively strange word order in (27), which is not typical for either Russian or Udmurt although found grammatical in both. Стекло прозрачное - Noun + Adj order, in contrast to usual Adj + Noun.of the alternations in the corpus happen on the trigger word, making violation of the equivalence constraint less probable.The similar syntax does not allow many opportunities for it. Therefore, all the examples that I managed to check turned out to be equivalent.however does not mean that the equivalence constraint is never violated in Russian/Udmurt code-mixing discourse, but rather that there is more precise ersearch needed to prove whether it is or not.

7.6.2 Free-Morpheme Constraint

Analyzing Udmurt/Russian code-mixing, it is often hard to distinguish between code-mixing and nonce-loans when the languages have been in such close contact for such a long time.

(29) В то время адямиос сыыче богатствоен нокинэ но не замечают солэн совесть не позволит бомжэн вераськыны таиз дась вераськыны но.

Interestingly, example (29) might be an example of violation according to the principles that we have worked out for annotation on the basis of (Budzhak-Jones 1995). The sentence starts in Russian, than one of the words get Udmurt morphology becoming a nonce-loan, but as the sentence continues in Udmurt it makes it a code-switching within one word.

.7 Future Improvements

Although we have built a corpus according to the suggested annotation principles there is always room for improvement. The priority in case of this Udmurt/Russian code-mixing corpus should be cleaning and expanding Udmurt grammar dictionary.has been discussed the equivalence constraint might be enough for Udmurt/Russian code-switching due to its similar syntax, but creating a chunker for Udmurt (there are a few Russian chunkers available) will potentially unify the process of code-mixing analysis.are ways to resolve homonymy throughout syntax, some of which we have done, however there is still work to be done in this direction, a chunkers might be also helpful in regard to this problem.addition, one of the all-time tasks is of course expanding the corpus and adding the texts to it.

8. Further Work

One of the main goals of this work was creating a unified system of code-mixing annotation, so that after combining it with morphological annotation a whole set of corpora could be built.multiple corpora of language pairs of various morphology, syntax, word order, as well as languages from the same family opposed to languages from different ones will give us a possibility to see a much wider picture of why and where code-mixing occurs in the speech, analyze existing constraint hypotheses and suggest other ones, therefore allow us to work on creating an extensive description of code-mixing in general. It will allow to research the code-mixing occurrences under different circumstances. For instance, Spanish/English (both SVO languages) versus Irish/English (VSO/SVO) versus Turkish/Dutch (SOV/V2); or Basque/Spanish (ergative/accusative) versus Udmurt/Russian (both accusative); or Ngen/French (agglutinative/fusional) versus Baoule/Ngen (agglutinative/agglutinative), etc., creating a whole set of corpora will allow to work with typology of code-mixing all around the world.

In course of this work I have developed the principles that a corpus of texts containing code-mixing should have and built a working prototype of Udmurt/Russian Code-Mixing Corpus on the basis of an Online Annotated Corpus of Udmurt Language (), created in 2014. have discussed different approaches to studying code-mixing and various classifications of code-mixing by different scholars, eventually choosing the types that are more generally accepted, including insertion, alternation and congruent lexicalization. I have analysed most of the constraints that were offered in regard of code-mixing in the last 65 years, both the once that are claimed to be universal and the language-specific once. I have described a way to annotate multilingual texts to ease the verification of equivalence constraint, governmental constraint and free-morpheme constraint, as well as some of the language specific constraints, although their implementation depends on every language pair. have tried to create the most flexible rules for annotation, so that they could be adapted for various language pairs. Although the most traditional theories were preferred throughout designing these methods, my main goal is to find the balance between the existing theories and what can be done automatically in order to create the best functional system possible.

This work was the first step in hopefully creating a whole set of corpora with such rules to increase the speed and accuracy of research the code-mixing, help check the existing theories, offer new once and give an opportunity to work with more specific examples and conduct more subtle research.of a set of corpora with both morphological and code-mixing annotation has a potential to give a huge start to typological studies of this phenomenon, as the result of significantly easier access to data analyses. It will create a possibility to move forward in finding answers to questions regularly raised by linguists researching the reasons of code-mixing, such as whether there are some sorts of constituents in discourse which can be switched and others which cannot or if there are some constituents which tend to be switched into one language rather than the other, or in what ways incorporated items combine with the rest of the discourse and many other. I have outlined some possible characteristics of such a set and built the first automatically annotated corpus of Udmurt/Russian, which has both morphological and code-mixing annotation. All files related to it, including formated dictionary, grammatical paradigms and other links can be found at: https://github.com/masha-medvedeva/UdmurtRussianCorpus

Acknowledgments

Special thanks to my supervisor Timophey Arkhangelskiy for helping me with this work every step of the way and to Nikolay Vakhtin for inspiring us both to work on this topic. Huge thanks to Michael Daniel for supplying me with endless materials on bilingualism and code-mixing. I am endlessly grateful to all our Udmurt informants who helped us searching for texts when we first started working on the Udmurt Corpus and supported our project with the hugest enthusiasm and everyone who has been so kind to provide their feedback.

References

1.Alvarez-Caccamo, Celso (1998). From switching code to code- switching: Towards a reconceptualisation of communicative codes. In P. Auer (ed.), Code-switching in conversation: Language, interaction and identity, pp. 29-50. London and New York: Routledge.

.Arkhangelskiy, Timofey (2012). Electronic Corpora of the Albanian, Kalmyk, Lezgian, and Ossetic Languages // Automatic Documentation and Mathematical Linguistics, Vol. 46, No. 2, pp. 118-123. Allerton Press.

.Auer, Peter (1984). Bilingual conversation. Amsterdam and Philadelphia: John Benjamins.

.Auer, Peter (1988). A conversation analytic approach to code-switching and transfer. In M. Heller (ed.), Codeswitching: Anthropological and socio- linguistic perspectives, pp. 187-214. Berlin and New York: Mouton de Gruyter.

.Auer, Peter (1995). The pragmatics of code-switching: A sequential approach. In L. Milroy and P. Muysken (eds.), One speaker, two languages: Cross-disciplinary perspectives on code-switching, pp. 115-135. Cambridge, UK and New York: Cambridge University Press.

.Auer, Peter (ed.) (1998). Code-switching in conversation: Language, interaction and identity. London and New York: Routledge.

.Auer, Peter (1999). From codeswitching via language mixing to fused lects: Toward a dynamic typology of bilingual speech. International Journal of Bilingualism, 3 (4), 309-332.

.Auer, Peter (2000). Why should we and how can we determine the base language of a bilingual conversation? Estudios de Sociolingu ̈ ́ıstica, 1 (1), 129-144.

.Auer, Peter (2005). A postscript: Code-switching and social identity. Journal of Pragmatics. Special Issue: Conversational Code-Switching, 37 (3), 403-410.

.Backus, Ad (1992). Patterns of language mixing: A study in Turkish-Dutch bilingualism. Wiesbaden: Harrassowitz.

.Backus, Ad (2003). Units in code switching: Evidence for multimorphe- mic elements in the lexicon. Linguistics, 41 (1), 83-132.

.Backus, Ad (2005). Codeswitching and language change: One thing leads to another? International Journal of Bilingualism, 9 (3-4), 307-340.

.Bentahila, A. (1983a). Language attitudes among Arabic-French bilinguals in Morocco, Clevedon, Avon: Multilingual Matters

.Bentahila, A. (1983b). Motivations for code-switching among Arabic-French code-switching. Language and Communication 3: 233-43

.Bentahila, A., and Davies, Eileen D. (1983). The syntax of Arabic-French code-switching. Lingua 59: 301-30

.Bentahila, A., and Davies, Eileen D. (1991). Constraints on code-switching: a look beyond grammar. In Papers for the Symposium on Code-Switching and Bilingual Studies: Theory, Significance and Perspective, Barcelona, pp. 396-404. Strasbourg: ESF

.Berk-Seligson, S. (1986). Linguistic constraints on intra-sentential code-switching: a study of Spanish/Hebrew bilingualism. Language in Society 15: 313-48

18.Berruto, G. (2005), Italiano parlato e comunicazione mediata dal computer, in Hölker K., Maaß Ch. (eds.), Aspetti dellitaliano parlato.

.Clyne, Michael. (1967) Transference and Triggering. The Hague: Nijhoff.

.Clyne, Michael. (1972) Perspectives on Language Contact. Melbourne: Hawthorn Press.

.Clyne, Michael. (1980) Triggering and language processing. Canadian Journal of Psychology. 34: 400-6.

.Clyne, Michael (1987). Constraints on code switching: How universal are they? Linguistics, 25 (4), 739-764.

.Clyne, Michael G (2003). Dynamics of language contact: English and immigrant languages. Cambridge, UK and New York: Cambridge University Press.

.Di Sciullo, A., Muysken, P., and Sing, R. (1986). Government and code-mixing. Linguistics 22:1-24

.Dorleijn, Margreet and Jacomine Nortier. 2009. Code-switching and the internet. In Barbara Bullock and Almeida Jacqueline Toribio (eds.) 2009. The Cambridge handbook of linguistic code-switching. 127-141. New York: Cambridge University Press.

.Eliasson, Stig (1989). English-Maori language contact: Code-switching and the free-morpheme constraint. Reports from Uppsala University Department of Linguistics, 18, 1-28.

.Fano, R. M. (1950) The information theory point of view in speech communication. Journal of the Acoustical Society of America 22.6, 1950

.Finlayson, R., Calteaux, K., & Myers-Scotton, C. (1998). Orderly mixing and accommodation in South African codeswitching. Journal of Sociolinguistics, 2(3).

.Gardner-Chloros, Penelope (1991). Language selection and switching in Strasbourg. Oxford and New York: Oxford University Press.

30.Golovko 2001 - Головко Е.В. Переключение кодов или новый код? // Европейский университет в Санкт-Петербурге. Труды факультета этнологии. Вып.1, СПб., 2001. - С.298 - 316.

.Grosjean, F. (1982), Life with two languages: an introduction to bilingualism, Cambridge, Harvard University Press.

.Gullberg, M., Indefrey, P., Muysken, P.(2009) Research techniques for the study ofcode-switching. In: Bullock, B.E. and Toribio, A.J. (eds.) The Cambridge Handbook on Linguistic Code-Switching. Cambridge University Press.

.Gumperz, John J. (1964). Hindi-Punjabi code-switching in Delhi. In: H. Lunt, ed. Proceedings of the Ninth International Congress of Linguistics, Cambridge, Massachusetts, 1962. The Hague: Mouton, 1115-1124, 1964

.Gumperz, J.J. (1982), Discourse strategies, Cambridge, Cambridge University Press., 1982

.Haugen, Einar. (1950a) The analysis of linguistic borrowing. Language 26.2., 210-231, 1950

.Haugen, Einar. (1950b) Problems of bilingualism. Lingua 2.3., 271-290, 1950

.Heller, Monica, ed. (1988) Codeswitching: Anthropological and sociolinguistic perspectives. Berlin/New York/Amsterdam: Mouton de Gruyter, 1988

.Heller, M. (1992) The politics of code-switching and language choice, Journal of mutilingual and multicultural development 13, 123-42, 1992

.van Hout, Roeland., Muysken, Pieter (1994): Modelling Lexical Borrowability, Language Variation and Change 6, 1994

.Jakobson, Roman, C. Gunnar M. Fant, and Morris Halle. (1952) Preliminaries to speech analysis: The distinctive features and their correlates. Cambridge (Mass.): The M.I.T. Press, 1952

.Jakobson, Roman. (1961) Linguistics and communication theory. In Roman Jakobson, ed. On the structure of language and its mathematical aspects. Proceedings of the XIIth Symposium of Applied Mathematics [New York, 14- 15 April 1960]. Providence (R.I): American Mathematical Society, 245-252, 1961

.Joshi, Aravind K. (1985a). How much context-sensitivity is necessary for assigning structural descriptions? Tree adjoining grammars. In D. R. Dowty, L. Karttunen and A. M. Zwicky (eds.), Natural language parsing: Psychological, computational, and theoretical perspectives, pp. 206-250. Cambridge, UK and New York: Cambridge University Press.

.Joshi, Aravind K. (1985b). Processing of sentences with intrasentential code switching. In D.R. Dowty, L. Karttunen and A.M. Zwicky (eds.), Natural language parsing: Psychological, computational, and theoretical perspectives, pp. 190-205. Cambridge, UK and New York: Cambridge University Press.

.Kolers, Paul A. (1966). Reading and talking bilingually. American Journal of Psychology, 79 (3), 357-376.

.Lehtinen M. K. T. (1966) An analysis of a Finnish-English bilingual corpus. Doctoral dissertation. Indiana University. Bloomington, 1966.

.Lipski, J. M. (1978). Code-switching and bilingual competence. In Fourth LACUS Forum. M. Paradis, editor, pp. 263-277. Columbia, S. C. Hornbeam Press

.MacSwan, Jeff (1999a). A minimalist approach to intrasentential code switching. New York: Garland.

.MacSwan, Jeff (1999b). A minimalist approach to intrasentential code switching: Spanish-Nahuatl bilingualism in Central Mexico. London and New York: Routledge.

.Maschler, Yael. (1998). On the transition from code-switching to a mixed code. In P. Auer (ed.), Code-Switching in Conversation. London: Routledge. 125-149.

.Milroy, Lesley (1980). Language and social networks. Baltimore: University Park Press.

.Milroy, L., & Muysken, P. (Eds.). (1994). One speaker, two languages: Cross-disciplinary perspectives on code-switching. Cambridge, England: Cambridge University Press.

.Muysken, P. (2000). Bilingual speech: A typology of code-mixing. Cambridge: Cambridge University Press.

.Muysken, Pieter (1981). Halfway between Quechua and Spanish: The case for relexification. In A. R. Highfield and A. Valdman (eds.), Historicity and variation in creole studies, pp. 52-78. Ann Arbor: Karoma.

.Muysken, Pieter (1988). Media Lengua and linguistic theory. The Canadian Journal of Linguistics/La Revue canadienne de Linguistique, 33 (4), 409-422.

.Muysken, Pieter (1996). Media Lengua. In S. G. Thomason (ed.), Contact languages: A wider perspective, pp. 365-426. Amsterdam and Philadelphia: John Benjamins.

.Muysken, Pieter (2000). Bilingual speech: A typology of code-mixing. Cambridge, UK and New York: Cambridge University Press.

.Muysken, Pieter (2005). Two languages in two countries: The use of Spanish and Quechua in songs and poems from Peru and Ecuador. In G. Delgado and J. M. Schechter (eds.), Quechua verbal artistry: The inscription of Andean voices [Arte expresivo Quechua: la inscripcio ́n de voces andinas], pp. 35-60. Bonn: Bonner Amerikanistische Studien.

.Muysken, Pieter; Kook, Hetty and Vedder, Paul (1996). Papiamento/ Dutch code-switching in bilingual parent-child reading. Applied Psycholinguistics, 17 (4), 485-505.

.Myers-Scotton, C. (1988). Code-switching and types of multilingual communities. In Language Spread and Language Policy, P. Lowenberg (ed.), pp.61-82. Washington, D.C.:Georgetown Univ.Press

.Myers-Scotton, C. (1989). Code-Switching with English: Types of switching, types of communities. World Englishes, 8:333-46

.Myers-Scotton, C. (1993). Duelling Languages: grammatical structure in codeswitching. Oxford: Clarendon University Press.

.Myers-Scotton, C., Jake, Janice L. and Okasha, M. (1996). Arabic and constraints on codeswitching. In Perspectives on Arabic Linguistics IX, Mushira Eid and Dilworth Parkison (eds.), pp.9-43. Amsterdam: Benjamins

64.Nait MBarek, M. , and Sankoff D. (1988). Le discours mixte arabe/français: emprunts ou alternances de langue? Canadian Journal of Linguistics 33(2). 143-154

.Nortier, J. (1989). Dutch and Moroccan Arabic in contact: code-switching among Moroccans in the Netherlands.Unpublished Ph.D. thesis, University of Amsterdam

.Nortier J. (1990a). Dutch-Moroccan Arabic Code-Switching among Moroccans in the Netherlands. Dordrecht: Foris

.Nortier, J. (1990b). Code-switching and borrowing. Paper presented at the Worshop on Ethnic Minority Languages, Gilze-Rijen

.Nortier, J. (1995). Code-switching in Moroccan Arabic/Dutch versus Moroccan Arabic/French language contact, International Journal of the Sociology of Language, Vol. 112: 81-95

.Nortier, J. , and Schatz, H. (1992). From one-word switch to loan: a comparison of between language pairs, Multilingua 11:173-94

.Pfaff, Carol W. (1976). Functional and structural constraints on syntactic variation in code-switching. In Papers from the Para session on diachronic syntax, B. Steever et al. (eds.), pp.248-59. Chicago:Chicago Linguistic Society

.Pfaff, Carol W. (1979). Constraints on language mixing. Language 55: 291-318

72.Poplack, S. (1980). Sometimes Ill start a sentence in Spanish y termino en español. Linguistics 18: 581-618

.Poplack, S. (1981). Syntactic structure and social function. In Latin language and communicative behavior, R.P.

.Duran (ed.), pp.169-84, Norwood, N.J.: Ablex

.Said, J. (1988). Codemixing and multilingual competence in Morocco. Paper presented at the Second DutchMoroccan Symposium, Leiden-Amsterdam, April.

.Sankoff, D., and Poplack, S. (1981). A formal grammar for code-switching. Papers in Linguistics: International Journal of Human Communication 14(1): 3-45

.Suihkonen, Pirkko (1998). Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports TR-2. Department of General Linguistics, University of Helsinki, 1998.

.Timm, L., (1975). Spanish-English code-switching:el porque y how-not-to. Romance Philology 28: 473-482

.Treffers-Daller, Jeanine (1994). Mixing two languages: French-Dutch contact in a comparative perspective. Berlin: Mouton de Gruyter.

.Trudgill, P. (1986). Dialects in contact. Oxford: Blakwell

.Sridhar, S.N., Sridhar, K. K. (1980). The syntax and psycholinguistics of bilingual code-mixing. In Studies in the Linguistic Sciences. 10, 203-215.

.Swigart, Leigh. (1992) Two codes or one? The insiders view and the description of codeswitching in Dakar. In Eastman 1992, 83-102

.Vogt, Hans. (1954) Language contacts. Word 10.2-3, 365-374

.Vakhtin, Golovko (2004) - Н. Б. Вахтин, Е.В. Головко. Социолингвистика и социология языка. (СПб., 2004. - 336 c.)

85.Wentz, James and McClure, Erica (1977). Aspects of the syntax of the code-switched discourse of bilingual children. In F. Ingemann (ed.), 1975 Mid-America Linguistics Conference papers, Lawrence, KS: University of Kansas.

.Winkler, E. (2001) Udmurt Languages of the World/Materials 212 LINCOM EUROPA, München, 2001

.Woolford, Ellen (1983). Bilingual code-switching and syntactic theory. Linguistic Inquiry, 14 (3), 520-536.

.Алатырев (1983) Краткий грамматический очерк удмуртского языка. Ижевск, 1983

.Бутолина (1942) Русско-удмуртский словарь. Ижевск, 1942 Перевощиков (1962) Грамматика современного удмуртского языка. Фонетика и морфология. Ижевск, 1962.

.Кириллова (2008) Удмуртско-русский словарь Ижевск, 1962

91.The Corpus of Udmurt Language - #"center">Размещено на Allbest.ru

Code-switching and types of multilingual communities

Code-switching and types of multilingual communities

Похожие работы на - Code-switching and types of multilingual communities