What is the motivation for text mining?
e. g. web pages, emails, corporate documents, scientific papers
Define Text Mining
The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources
What are some application areas for Text Mining?
Give an example of a mixture of document clustering and classification
Google News first clusters different news articles. Afterwards the classify the news articles
What is the goal of sentiment analysis?
For which area can you apply sentiment analysis?
On document level: analysis of a whole document (tweets about president)
On feature/aspect level: analysis of product reviews (polarity values for different features within a review)
Explain search log mining
What are application areas for search log mining
1) Search term auto-completion (association analysis)
2) Query topic detection (classification)
What is information extraction?
What are the subtasks of information extraction
1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER
What is the difference between Search/Query and Discovery?
Search/Query is Goal-oriented: You know what you want
Discovery is opportunistic: You don’t know in advance what patterns you identify in your data
Explain the text mining process
Which techniques are used for text preprocessing
Which syntactic and linguistic text preprocessing techniques exist?
Explain Stopword Removal
You should remove stopwords:
What is Stemming?
Usefulness for Text Mining:
What are the basic stemming rules?
What are feature generation methods?
1) Bag-of-Words
2) Word Embeddings
Explain the Bag-of-words feature generation
Briefly explain the three different techniques for vector creation: Binary Term occurrence, Term occurrence, and Terms frequency
1) Binary Term occurrence: Boolean attribute describes whether or not a term appears in the document (no matter how often
2) Term occurrence: Number of occurrences of a term in the document (problematic with texts of different length)
3) Terms frequency: frequency in which a term appears (number of occurrences / number of words in document; for documents with different length)
Explain the TF-IDF Term Weighting Schema for feature generation (Term frequency inverse document frequency)
How does the TF-IDF distribute weights to words?
Explain the feature generation method Word Embeddings
How can you conduct feature selection for text mining?