[CodeReview] NLP_문장 및 단어 유사도 분류

728x90

https://github.com/kkobooc/NLP_KoreanHateSpeech

GitHub - kkobooc/NLP_KoreanHateSpeech: 한국어 자연어 처리 기술을 이용하여, 온라인 연예 기사 뉴스의 댓

한국어 자연어 처리 기술을 이용하여, 온라인 연예 기사 뉴스의 댓글들을 혐오 및 공격성에 따라 hate/offensive/none으로 분류하는 Kaggle 프로젝트 - GitHub - kkobooc/NLP_KoreanHateSpeech: 한국어 자연어 처리

github.com

01_data_skimming (3)-checkpoint.ipynb

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd

CountVectorizer : 단어들의 카운트(출현 빈도)로 여러 문서들을 벡터화 카운트 행렬

모두 소문자로 변환시키기 때문에 me와 Me 는 모두 같은 특성이 된다.

문서를 토큰 리스트로 변환한다.
각 문서에서 토큰의 출현 빈도를 센다.
각 문서를 BOW 인코딩 벡터로 변환한다.

TfidVectorizer : 위의 CounterVectorizer와 비슷하지만 TF-IDF 방식으로 단어의 가중치를 조정한 BOW 인코딩 벡터를 만든다.

TF-IDF 방식 : 단어를 갯수 그대로 카운트하지 않고 모든 문서에 공통적으로 들어잇는 단어의 경우 문서 구별 능력이 떨어진다고 보아 가중치를 축소하는 방법이다.

TF(Term Frequency) : 특정 단어가 하나의 데이터 안에서 등장하는 횟수

DF(Document Frequency) : 특정 단어가 여러 데이터에 자주 등장하는지를 알려주는 지표.

IDF(Inverse Document Frequency) : DF에 역수를 취해(inverse) 구함

TF-IDF : TF와 IDF를 곱한 값. 즉 TF가 높고, DF가 낮을수록 값이 커지는 것을 이용하는 것입니다.

train = pd.read_csv('./datas/train_ver1', index_col=[0])

train.head()

datas 파일의 train_ver1 파일을 불러오는데 0의 열값을 인덱스 번호로 지정해준다.

train.head() 는 앞에 5줄만 보도록 출력하는 함수이다.

ct = CountVectorizer()

ct.fit(train['comments'])

print(ct.vocabulary_)

train 데이터의 comments 열 값을 fit 함수를 통해 모든 토큰의 단어 사전을 학습시킨다.

vocabulary_ 는 파이썬 dict 형식으로, 피처값과 해당 단어를 맵핑해서 보여준다.

sentence = [train['comments'][0]]

print(ct.transform(sentence).toarray())

train 데이터의 comments 열 값의 첫번째 값을 array와 해준다.

모두 0으로 나타나는것을 볼 수 있다.

tfidf = TfidfVectorizer()

tfidf.fit(train['comments'])

print(tfidf.vocabulary_)

tfidf 변수에 TfidfVectorizer화한다.

가중치를 조정하여 위와 같은 방법으로 진행한다. 텍스트 데이터를 단어사전으로 만들고

해당단어의 tf를 구하고, 이후 전체 문장에서 idf를 구한 후, 해당 값에 역수를 취해준 idf를 만들어 서로 곱해준다.

sentence = [train['comments'][1]]

print(tfidf.transform(sentence).toarray())

train 데이터의 comments 열 값의 두번째 값을 array와 해준다.

모두 0으로 나타나는것을 볼 수 있다.

tfidf_matrix = tfidf.fit_transform(train['comments'])

idf = tfidf.idf_

print(dict(zip(tfidf.get_feature_names(), idf)))

유사도를 구하는 과정이다. 텍스트에서 모든 단어들을 추출하고 문장을 벡터화하여, 해당 단어들을 발생 빈도로 벡터라이징 해주면 된다.

reviews = list(train['comments'])

tokenized_reviews = [r.split() for r in reviews]

review_len = [len(t) for t in tokenized_reviews]

reviews 변수에 train 데이터의 comments 열값을 리스트화해서 넣어준다.

tokenized_reviews에는 공백을 기준으로 데이터를 분리한다.

review_len은 tokenized_reviews 변수에 주어진 값을 반복문을 통해 길이를 확인한다.

review_len_eumjeol = [len(s.replace(' ', '')) for s in reviews]

import matplotlib.pyplot as plt

plt.figure(figsize=(12,5))

plt.hist(review_len, bins=50, alpha=0.5, color='r', label='words')

plt.hist(review_len_eumjeol, bins=50, alpha=0.5, color='b', label='alphabet')

plt.yscale('log', nonposy='clip')

plt.title('Comments Length Histogram')

plt.xlabel('Comments Length')

plt.ylabel('Number of Comments')

plt.show()

figsize는 그림 크기를 의미한다.

레드 색깔로 review_len에 대한 값을 히스토그램으로, 블루 색깔로 review_len_음절을 나타낸다. y축값은 로그로 표현한다.

import numpy as np

print('문장 최대길이: {}'.format(np.max(review_len)))

print('문장 최소길이: {}'.format(np.min(review_len)))

print('문장 평균길이: {:.2f}'.format(np.mean(review_len)))

print('문장 길이 표준편차: {:.2f}'.format(np.std(review_len)))

print('문장 중간길이: {}'.format(np.median(review_len)))

print('제 1 사분위 길이: {}'.format(np.percentile(review_len, 25)))

print('제 3 사분위 길이: {}'.format(np.percentile(review_len, 75)))

plt.figure(figsize=(12,5))

plt.boxplot([review_len], labels=['token'], showmeans=True);

plt.figure(figsize=(12,5))

plt.boxplot([review_len_eumjeol], labels=['eumjeol'], showmeans=True);

figsize는 그림 크기를 의미한다.

박스플롯은 수치데이터를 표현하는 하나의 방식으로, 최소값, 제1사분위수, 제2사분위수, 제3사분위수와 최대값이 활용이 된다.

stopwords = pd.read_csv('https://bab2min.tistory.com/attachment/cfile2.uf@241D6F475873C2B1010DEA.txt', sep='\t', header=None, names=['형태','품사','비율'])

stopwords_list = stopwords['형태'].tolist()

stopwords 변수에 txt 파일을 csv 파일로 읽어온다. 이름은 형태/품사/비율로 나눈다.

그 후, stopwords_list에 형태에 대한 값을 리스트화하여 나타낸다.

from wordcloud import WordCloud

%matplotlib inline

from matplotlib import font_manager

f_path = "C:\Windows\Fonts\malgun.ttf"

font_manager.FontProperties(fname=f_path).get_name()

from matplotlib import rc

rc('font', family='Malgun Gothic')

wordcloud = WordCloud(font_path=f_path, stopwords=stopwords_list, background_color='black', width=800, height=600).generate(' '.join(train['comments']))

plt.figure(figsize=(15,10))

plt.imshow(wordcloud)

plt.axis('off')

plt.show()

워드 클라우드란 메타 데이터에서 얻어진 태그들을 분석하여 중요도를 고려하여 시각적으로 늘어놓아 표시하는 것이다. 순서는 알파벳/가나다 순으로 배치된다. 중요도에 따라 글자의 색상이나 굵기 등 형태가 변한다.

import seaborn as sns

label = train['hate_label'].value_counts()

fig, ax = plt.subplots(ncols=1)

fig.set_size_inches(6,3)

sns.countplot(train['hate_label']);

seaborndms matplotlib을 기반으로 다양한 색상 테마와 통계용 차트등의 기능을 추가한 시각화 패키지이다.

value_counts()를 통해 ‘hate_label’에 해당하는 인자들의 개별합을 구할 수 있다.

countplot은 항목별 갯수를 세어준다.

train_hate = pd.read_csv('./datas/train.hate.csv')

train_hate.rename(columns={'label': 'hate_label'}, inplace=True)

train_newstitle = pd.read_csv('./datas/train.news_title.txt', sep='\t', names=['news_title'])

train = pd.merge(train_hate, train_newstitle, left_index=True, right_index=True)

train.tail()

train_hate 변수에 train.hate,csv 파일을 읽어온다. trainhate의 열 이름을 다시 지어준다.

train_newstitle변수에는 news_title에 해당하는 값을 읽어온다.

train 변수에 앞의 두 변수를 합친 후, 그 값의 맨 앞부터 5번째까지 보여준다.

## fill NaN values

is_NaN = train.isnull()

row_has_NaN = is_NaN.any(axis=1)

rows_with_NaN = train[row_has_NaN]

rows_with_NaN

NaN 밸류값을 넣어준다.

train['news_title'].fillna("No Title", inplace=True)

fillna를 이용해 nan 데이터를 어떤 값으로 채운 후에 실제 데이터 프레임에 저장할지를 결정하는 변수로 inplace=True로 설정하면 된다.

## remove stopwords

stopwords = pd.read_csv('https://bab2min.tistory.com/attachment/cfile2.uf@241D6F475873C2B1010DEA.txt', sep='\t', header=None, names=['형태','품사','비율'])

stopwords_list = stopwords['형태'].tolist()

def process_text(text):

clean_words = [word for word in text if word not in stopwords_list]

return clean_words

train['comments'] = train['comments'].apply(process_text)

train['comments'] = [''.join(l) for l in train['comments']]

stopwords를 지워주는 과정이다.

stopwords 변수에 csv 파일을 읽어오고 이를 리스트화한다.

process_text 함수는 stopwords 리스트에 단어 가 없다면 그 해당 단어를 리턴한다.

train데이터의 comment열에 process_text 함수를 적용하고 comments 열 값을 연결한다.

## remove punctuation

import string

## remove basic punctuation

def remove_punc(text):

text_nopunc = "".join([char for char in text if char not in string.punctuation])

return text_nopunc

train['comments'] = train['comments'].apply(lambda x: remove_punc(x))

train['news_title'] = train['news_title'].apply(lambda x: remove_punc(x))

기본 구두점을 지우는 코드이다. comments와 news_title 열에 있는 구두점 문자열을 삭제한다.

import re

# remove all punctuations except korean, english, and number

def cleanse(text):

pattern = re.compile(r'\s+')

text = re.sub(pattern, ' ', text)

text = re.sub('[^가-힣ㄱ-ㅎㅏ-ㅣa-zA-Z0-9]', ' ', text)

return text

train['comments'] = train['comments'].apply(cleanse)

train['news_title'] = train['news_title'].apply(cleanse)

train.head()

train.to_csv('./datas/train_ver3')

한국어와 영어 그리고 숫자가 아닌 모든 표현은 제거한다.

학습한 데이터를 train_ver3 이름으로 csv파일로 저장해준다.

아 .ipynb 파일 하나가 에바꽁치참치로 너무 길잖아~!!~~!!~!~~!~!!

난 못해 난 모대

728x90

'AI > Machine Learning&Deep Learning' 카테고리의 다른 글

[CodeReview] 웹크롤링/유틸스 코드리뷰 (0)	2021.09.24
[CodeReview] 음성클래스 분류 코드리뷰 (0)	2021.09.18
딥러닝챗봇_토크나이징 (0)	2021.07.27
Konlpy를 활용한 한국어 분석 (0)	2021.07.12
Deep Learning_딥러닝을 활용한 분류 예측 (0)	2021.04.06

'AI > Machine Learning&Deep Learning' 카테고리의 다른 글

티스토리툴바