代做COMPSCI 5096 TEXT AS DATA 2022代写留学生Java程序

2025-06-26 代做COMPSCI 5096 TEXT AS DATA 2022代写留学生Java程序

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

TEXT AS DATA M

COMPSCI 5096

Tuesday 3 May 2022

1.          Question on Distributional Semantics and Word Embedding. (Total marks: 18)

Consider the problem of finding an outlier word from among a list of other similar words, e.g., out of the following set of words -

linux,   windows,    solaris,    android,     java,

the word javais an outlier (because the other words are names of operating systems).

Given a list of such words your task is to automatically infer the outlier word. With respect to this task, answer the following questions.

(a)  An approach to solve the word intrusion problem is to represent words as vectors and then make use of the relative distances/similarities between the vectors for finding the outlier.   Assuming you know (by the output of some process) the vector w for a word w, describe the pseudo-code of finding the outlier word.

Task: Describe the pseudo-code of this algorithm.  Clearly state your assumptions and introduce your notations in the algorithm.           [5]

One solution to the word outlier detection problem that does not require learning any  parameters (via gradient descent) is the distributional semantics vector approach, where   each word is represented via a bag-of-words vector of contexts. Now, answer the following: (b)  The window size, k, used to define the contexts for each word is an important parameter of this approach. What happens if k is too large or too small?                                               [2]

(c)  Describe the pseudo-code of this approach that requires only a single pass through a collection (clearly describe the data structures for an efficient solution).                          [5]

(d)  Discuss (with an example) why the vectors of function words (frequent words, such as of’, ‘the’ etc.) obtained with this approach are not reliable.                                                      [2]

Now consider word2vec, which is a noise contrastive estimation based method that learns the vectors for each word. With respect to word2vec, answer the following questions.

(e)  What is the role of negative samples in the objective function of word2vec?                   [2]

(f)  Comment about word2vec’s output for a word with multiple meanings, such as jaguar, bank or python.   What would you expect to find as the nearest neighbors of such polysemous words?  What is the problem if you use such vectors from such words for another task such as text classification?         [2]

2.          Question on word frequencies and language model (Total marks: 15)

An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a fiv e letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text:

abcaedabccbaedabceda

(a)  Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above.                                                                                   [3]

Example Answer:

trigram    frequency

abc              3

eda              3

aed              2

dab              2

(b)  Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long.               [3]

Example Answer:

n    max n-grams

1              5

2             25

3             125

4            590

5            589

(c)  A linguist makes a breakthrough in understanding the tokens used in the alien text.  She provides two possible ways to tokenize the sample text.

(i)  In plain English, explain a single rule that could reproduce this rst tokenization

a    bca    eda    bccba    eda    bceda

[1]

Example Answer:  Start a new token whenever the previous character is ’a’ .

(ii)  In plain English, explain a single rule that could reproduce this second tokenization

ab    caedab    ccbaedab    ceda

[1]

2.          Question on word frequencies and language model (Total marks: 15)

An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a five letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text:

abcaedabccbaedabceda

(a)  Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above.                  [3]

(b)  Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long.               [3]

(c)  A linguist makes a breakthrough in understanding the tokens used in the alien text.  She provides two possible ways to tokenize the sample text.

(i)  In plain English, explain a single rule that could reproduce this first tokenization

a

bca

eda

bccba

eda

bceda

[1]

(ii)  In plain English, explain a single rule that could reproduce this second tokenization

ab

caedab

ccbaedab

ceda

[1]

(d)  More alien probes crash land in different parts of the world.  Scientists want to measure the similarity between the text found in each probe. Here are two tokenized probe texts fragments.

Probe Text A:

a

eda

bceda

eda

bcda

bce

Probe Text B:

ca

eda

bcba

eda

bceda

eda

bce

(i)  Calculate the Sørensen–Dice Coefficient and Jaccard Similarity between the two probe texts. Show your work.

[4]

(ii)  Calculate the similarity between the third probe text (Probe Text C below) and the two prior probe texts using the Sørensen–Dice Coefficient. Using these results, show that the Sørensen–Dice Coefficient is a semi-metric as it breaks the triangle inequality.

Probe Text C:

beda

bceda

bceca

ebeda

bceda

b

[3]

3.          This question is about Natural Language Processing (Total marks: 18)

You just landed an awesome job at the Intellectual Property Office. As your first project, you have been tasked with automatically classifying submitted patent applications into one of the eight broad International Patent Classification sections, as shown here:

(a)  You start by applying a typical pre-processing pipeline that consists of case normalisation and a stemmer. Within the context of patent classification application, clearly justify these two pre-processing stages and provide an example that shows why it could lead to improved classification performance.      [4]

(b)  You recall from Text as Data that NLP features, such as parts of speech, are often helpful for classification tasks. Within the context of patent classification, provide and justify a specific example where considering a word along with its part-of-speech may help distinguish between two of the above sections.             [3]

(c)  Armed with the above intuition, you select an off-the-shelf part-of-speech tagger (based on a Hidden Markov Model) that reports 97% accuracy and apply it to some sample patents to ensure that it produces reasonable part-of-speech tags. To your dismay, you find that it frequently makes mistakes. On closer inspection, you observe that the errors are usually on specialised, domain-specific language in the patents. Explain why this problem arises and what you could do to fix it.       [4]

(d)  You want to identify whether two systems (called System A and System B) are better than a baseline method at the classification task. The following table shows intrinsic evaluation metrics obtained over the classification on the train and test sets:

 

Train Set

Test Set

Precision

Recall

Precision

Recall

Baseline

0.61

0.42

0.56

0.50

System A

0.62*

0.43*

0.58*

0.51

System B

0.67*

0.48*

0.51

0.42

* statistical significance w.r.t. baseline (t-test with p-value < 0.05)

Discuss the effectiveness (e.g., generalizability, overfit/underfit, performance on training/test sets etc.) of the models A and B in comparison to the ‘Baseline’ method.                       [3]

(e)  Meanwhile, another team has been busy building a BERT-based text classifier, and they have found that it also works well on the task. You decide to join forces with them. Without using an ensemble approach, how might you go about including explicit parts-of-speech into their BERT-based model? How is the technique different than the approach you took in your linear bag-of-features model?           [4]