[NLP] 자연어 처리(NLP) - 단어의 토큰화

컴퓨터가 단어를 알아들을 수 있게 단어를 토큰으로 인코딩하는 방법을 소개한다.

개발 환경

Python 3.10.16

tensorflow 2.16.1

토큰화

언어를 숫자로 인코딩

sentences = [
    'Today is a sunny day',
    'Today is a ranny day',
    'Is it sunny today?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

시퀀스

sentences = [
    'Today is a sunny day',
    'Today is a ranny day',
    'Is it sunny today?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sequence = tokenizer.texts_to_sequences(sentences)
print(sequence)

OOV 토큰

신경망이 단어의 문맥을 모를 떄 새로운 텍스트를 정의하는 방법

패딩(Padding)

신경망을 훈련시킬 때 모든 데이터 크기를 동일하도록 맞추는 방법

sentences = [
    'Today is a sunny day',
    'Today is a ranny day',
    'Is it sunny today?',
    'I really like a snow day'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sequence = tokenizer.texts_to_sequences(sentences) # 리스트 반환
# print(sequence)

padded = pad_sequences(sequence) # numpy 배열을 반환
print(padded)

https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

[tf.keras.utils.pad_sequences | TensorFlow v2.16.1

Pads sequences to the same length.

www.tensorflow.org](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences)

프리패딩(prepadding)

padded = pad_sequences(sequence, padding = 'post')

불용어 제거 및 텍스트 정제

의미가 많지 않은 그런 단어를 제외하여 인코딩하기

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import nltk

# 불용어 리스트 다운로드
nltk.download('stopwords')

# 영어 불용어 목록
stop_words = set(stopwords.words('english'))
print(stop_words)

예시 : 텐서플로우 데이터셋(imdb_reviews)에서 텍스트 가져오고 처리

https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

IMDB Dataset of 50K Movie Reviews

Large Movie Review Dataset

www.kaggle.com

가져오기

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras


imdb_sentences = []
train_data = tfds.load('imdb_reviews', split='train')
train_data = tfds.as_numpy(train_data)

for item in train_data:
    imdb_sentences.append(str(item['text']))

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=500)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)

print(tokenizer.word_index)

인코딩

- 문장을 리스트로 변환

디코딩

- 리스트를 다시 문장으로 변환

예시 : 실제 GPT 동작 과정에서의 token화

https://platform.openai.com/tokenizer

IT에서 살아남기