Pipeline of natural language processing
1) Text processing
2) Feature extraction
3) Modeling
Text processing
Feature extraction
Modeling
How to read a file in python
with open("hola.txt", "r") as f:
text = f.read()How to read tabular data or csv
- df = pd.read_csv(“hola.csv”)
How to get a website or a file in the web?
import requests
# Fetch a web page
r = requests.get("https://www.udacity.com/courses/all")How to clean the text from a website?
# Remove HTML tags using Beautiful Soup library soup = BeautifulSoup(r.text, "html5lib") print(soup.get_text())
Tips for text cleaning
- In document classification or clustering eliminate punctuation
How to eliminate punctuation
Useful libraries
What is a token
Tokenization with NLTK
Stop word removal
Part-of-speech tagging
Named entity recognition
Stemming and Lemmatization
Lesson summary
1) Normalize
2) Tokenize
3) Remove stop words
4) Stem / Lemmatize
Bag of words
TF-IDF
- tdidf = td*idf = count(d,t)/|d| * log(|D|/|dED:tEd|)
Word2Vec
GloVe
language model
A language model captures the distributional statistics of words. In its most basic form, we take each unique word in a corpus, i, and count how many times it occurs.
Bigram Model