lecture 9 تابع الجزء الثاني
هذه القضايا العامة ل كوربس corpus
Corpus can't show what doesn't occur, or all that can occur
محدودية كوربس limitation
Introspection may be surprised by what does occur
عند التعمق قد تتفاجأ بما قد يحدث
Areas of language that corpora don’t illumine
محدودية كوربس limitation
البقية لا تحتاج شرح او ترجمة مذكورة كنقاط في المحاضرة و لم يضف الدكتور
ملاحظه
Size of corpus and individual word frequency. How big should it be?
Cost effectiveness - more running words doesn't give more different words proportionally
10-20 hours to process 2000 words of speech (prosodic tagging)
Just because a population is vast does not mean samples have to be vast to be representative, as some think… Depends on feature of interest and variability. Word frequency problem
Static or dynamic (monitor) corpora?
فحص ساكن او متحرك للكوربرا
Sampling and how to be representative e.g. of general English? Any collection of texts is not a useful (principled) corpus. Problems…
اخذ مجموعة متفرقة من كل مدينة او منطقة و تحديد مدى حجم المراد تجميعه
Opportunistic - biased to written, accessible varieties?
Systematic- balanced and representative: a corpus of corpora
الموازنة
Exclude non-standard?
مثلا slang language
What national varieties?
لهجات متعدده و مختلفة
How far back?
هل نضمن معلومات من سنه او من اسبوع
What proportions of varieties?
Speaker/writer factors as well (demographics)? Problem more with written than spoken (L1 from name?). Addressee
معلومات عن المحاضر مستوى اللغة عمره خلفية عنه تساعدنا لاحقا
Then: Random selection?
اختيار جيد نعم
Stratified sampling? What varieties?
سواء عندنا اختيار مختلف او نحتاج لتخصيص
Weighting by how much read or by 'influence'? Expert judgment
طريقة لاختيار عند اضافة موضوع مثلا او مدى تأثير الموضوع
Even genres like ‘academic writing’ are not homogeneous: depend on subdiscipline
(Business and Econs I, Computing and Physics we), genre within subdiscipline (review, report), even the lecturer being written for
اختيار نماذج لموضوع معين
How to sample each text, and sample size again? Copyright issues
اخذ حقوق النسخ لاختيار بياناتك
Spoken? how natural are speeches, TV etc.?
التحديد عند اختيار موضوع تحدث
Fully natural: observer’s paradox and how to be ethical? Permission. Labov’s tricks
Records of speakers (and addressees and…)
مثلا عند تسجيل محادثة احدهم يجب اخذ الاذن منه و لكن اذا علم الشخص انك سوف تسجله
قد يخجل او يتردد و لا يتكلم بحريه و لا يكون بطبيعته
Transcription issues: what to transcribe and who does it (expert or not)
نسخ للتقرير هل تعمله بنفسك او تحتاج الى خبير
Random sampling again; problem of accents and dialects
لا يتم يدويا تحتاج لتحليله لنظام اووماتيكي عن طريق برنامج مثلا
Analysis - how to extract useful information automatically?
frequency and its derivatives:
range: over text types
richness of vocab: TTR
collocational strength: mi and t-score/z score
اي كلمة تاتي و تتوافق مع الاخرى مثلا
handsome girl خطا
handsome boy صحيح
how to relate go, goes and went? lemmatisation
lemmatisation as go goes went
derivation as inform informative information
concordance: the problem of large numbers. Qualitative into quantitative
results
qualitative numbers
quantitative describe
how to distinguish right from right: pos and other annotation/tagging
معاني متعدده لكلمة واحده
right = صحيح
right = حقوق - حق
how to sort and select from a KWIC listing?
لم يذكر عنه شيء
Accessibility to general users – cost, computers etc.
try their demo if you wnat to use it you have to buy it first
BNC= British National Corpus
-------------------------------------------------------------------
The above issues all repeat for learner corpora. Further, issues (see ICLE solutions):
What counts as a learner? Cf ICE
Information about learner language that is not reflected in a learner corpus
What counts as ‘authentic’ for learners?
Apart from L1, what variables would you want to have documented about the students and the tasks/setting for any collection of learner material in a corpus? (Cf Granger 2002 discussion) These all may make a difference
Problem therefore of comparability of such corpora collected by different people in different countries
Possibility of longitudinal corpora
Contrastive interlanguage analysis
NNS – NS To find errors and over/under use. But issues of:
Comparability of variety
Linguistic imperialism (terms like error, overuse), but problem of learners’ real wishes and lack of information on ‘international proficient speaker English’
NNS – NNS To distinguish transfer and non-transfer (e.g. developmental) errors.
Comparability again
Parallel L1 corpus of the learners would be useful
Computerised error analysis
Method 1: Think of an error and search for it
Method 2: Tag all errors in corpus and then search
-------------------------------------------------------------------------
بختصار اللي فهمته
NNS = non native speaker
NS= native speaker
قضايا متعلقة بالتعلم
امثلة في الحياة اليومية
مقارنة متحدثين اللغة الام بمتحدثين اللغة الثانية لايجاد فروقات و الاخطاء في استعمال اللغة
مثلا العرب يكررون و و و and and and
الاخطاء في نقل و ترجمة اللغة
نحتاج الى سوفت وير لاعطاءنا تحليل لهذه الاخطاء
منهج1 اذا اردت تحليل لكلام العرب احيانا يستخدمون كلمة بشكل خاطئ مثلا agenda
منهج2 اذا كان لدينا سوفت وير جيد بضغطة واحده قد نحصل على كل الاخطأء و نبحث عنها
-------
سبب التعديل ضبطته لكم لانه غثيث شوي
end