[ADP 실기 with R] 11. 텍스트마이닝: 문자열 전처리, 한국어(KoNLP), 영어(SnowballC), SNA

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

No Story, No Ecstasy

[ADP 실기 with R] 11. 텍스트마이닝: 문자열 전처리, 한국어(KoNLP), 영어(SnowballC), SNA 본문

Data Science Series

[ADP 실기 with R] 11. 텍스트마이닝: 문자열 전처리, 한국어(KoNLP), 영어(SnowballC), SNA

heave_17 2020. 12. 12. 17:25

0. 문자열(String) 전처리

- R 코드 예제

# 문자형 변환
as.character(data)

# String 합체
paste(string1, string2) # default sep = " "
paste0(string1, string2) # default sep = ""
paste(string1, string2, sep = "", collapse = ".") #collapse 기준으로 하나의 문자열로 합체

# 개수 세기
nchar(x)

# 여러 벡터 간 관계 파악
union(strings1, strings2) # 합집합
intersect(strings1, strings2) # 교집합

# substring 추출
substr(string, start, stop)
substring(string, start)

# split
strsplit(string, split = " ") #split 기준으로 자르기, list가 반환됨

# 찾기, 바꾸기
grep(pattern, strings, value = F) # value = T면 해당하는 string 반환
sub(pattern, replacement, strings, ignore.case = T) # 첫 번째 한 번만
gsub(pattern, replacement, strings, ignore.case = T) # 모두

- Regex 표현식 모음

R Regex 모음

[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]

[:alpha:] Alphabetic characters: [:lower:] and [:upper:]

[:blank:] Blank characters: space and tab, and possibly other locale-dependent characters

[:cntrl:] Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL).

[:graph:] Graphical characters: [:alnum:] and [:punct:]

[:lower:] Lower-case letters in the current locale

[:print:] Printable characters: [:alnum:], [:punct:] and space

[:punct:] Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~. %'"`

[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly others

[:upper:] Upper-case letters in the current locale

[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

? The preceding item is optional and will be matched at most once.

* The preceding item will be matched zero or more times.

+ The preceding item will be matched one or more times.

{n} The preceding item is matched exactly n times.

{n,} The preceding item is matched n or more times.

{n,m} The preceding item is matched at least n times, but not more than m times.

1. 한국어 (KoNLP)

- R 코드 예제

# 0. package import
library(tm)
library(rJava)
library(KoNLP)
library(wordcloud)
library(dplyr)
library(ggplot2)

# 1. Dictionary 구축
##  - extraNoun(text): 명사 추출
##  - SimplePos22(text): 형태소 분석
useSejongDic()
buildDictionary(ext_dic = "woorimalsam", user_dic = data.frame(dic_data, "ncn"), replace_usr_dic = T)
#buildDictionary(ext_dic, data) # data: 추가할 단어가 포함된 data.frame or file

# 2. raw data 전처리
## - tm_map을 활용해도 됨 > tm_map(x, tolower)
## - 적용 가능 함수: tolower, stemDocument, stripWhitespace, removePunctuation, removeNumbers, removeWords, PlainTextDocumnet
clean_txt = function(x) {
	x = tolower(x)
    x = removePunctuation(x)
    x = removeNumbers(x)
    x = stripWhitespace(x)
    return(x)
}
text_data = clean_txt(text_data)
## 이 외에도 gsub를 이용해서 원하는 표현을 제거할 수 있음
text = gsub("[[:punc:]]", "", text) # punc: 특수문자 / digit: 숫자 / A-z: 알파벳 / alnum: 영문자, 숫자

# 3. Corpus 생성 (Corpus: 바로 알고리즘에 적용 가능한 데이터 구조)
text_corpus = VCorpus(VectorSource(text_data))

# 4. TDM 생성
tdm = TermDocumentMatrix(text_corpus, control = list(dictornary = dic_data))

# 5. 다양한 분석
## 5.1. 많이 나온 단어 plotting
tdm1 = as.matrix(tdm)
tdm1 = sort(rowSums(tdm1), decreasing = T)
tdm1 = data.frame(word = names(tdm1), freq = tdm1)
tdm1 = dtm1 %>% mutate(word = factor(word, levels = word))
ggplot(tdm1[1:10,], aes(x = word, y = freq)) + geom_bar(stat = "identity")

## 5.2. 명사만 추출하여 wordcloud 생성
wordc = sapply(text_data, extractNoun)
nouns = as.vector(unlist(wordc))
nouns = nouns[nchar(nouns) > 1]
nouns_wc = data.frame(wort(table(nouns)), decreasing = T))
wordcloud(nouns_wc$nouns, nouns_wc$Freq, min.freq = 10, colors = brewer.pal(8, "Dark2"))

## 5.3. 연관 분석
findAssocs(tdm1, c("interested_word1", "interested_word2"), corlimit = 0.7)

## + 형태소 분석
###  - 보통명사: NC / 고유명사: NQ / 동사: PV / 형용사: PA / 부사: MA
library(stringr)
# paste: Concatenate vectors after converting to character.
doc = as.character(text$Content)
pos = paste(SimplePos09(doc)) #Pos22: 9개의 품사 태그를 달아줌 / Pos22: 22개의 품사 태그를 달아줌
extracted = grep(pattern = "P", x = pos, value=T)

2. 영어 (KoNLPSnowballC)

- R 코드 예제

# 0. package import
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

# 1. Dictionary 구축
##  - extraNoun(text): 명사 추출
##  - SimplePos22(text): 형태소 분석
useSejongDic()
buildDictionary(ext_dic = "woorimalsam", user_dic = data.frame(dic_data, "ncn"), replace_usr_dic = T)
#buildDictionary(ext_dic, data) # data: 추가할 단어가 포함된 data.frame or file

# 2. Corpus 생성 (Corpus: 바로 알고리즘에 적용 가능한 데이터 구조)
text_corpus = VCorpus(VectorSource(text_data))
inspect(text_corpus)

# 3. raw data 전처리
toSpace = content_transformer(function(x, pattern) gsub(pattern, " ", x))
text_corpus = tm_map(text_corpus, toSpace, "/")
text_corpus = tm_map(text_corpus, toSpace, "@")
text_corpus = tm_map(text_corpus, toSpace, "\\|")

text_corpus = tm_map(text_corpus, content_transformer(tolower))
text_corpus = tm_map(text_corpus, removeNumbers)
text_corpus = tm_map(text_corpus, removeWords, stopwords("english")) #remove common stop words
text_corpus = tm_map(text_corpus, removeWords, c("asedf", "asdf")) #remove own stop words
text_corpus = tm_map(text_corpus, removePunctuation)
text_corpus = tm_map(text_corpus, stripWhitespace)

# 4. TDM 생성
tdm = TermDocumentMatrix(text_corpus, control = list(dictornary = dic_data))
tdm1 = as.matrix(tdm)
tdm1 = sort(rowSums(tdm1), decreasing = T)
tdm1 = data.frame(word = names(tdm1), freq = tdm1)

# 5. 다양한 분석
# 5.0. 최소 빈도수 이상 나온 단어들 확인
findFreqTerms(dtm1, lowfreq = 5)

## 5.1. 많이 나온 단어 plotting
tdm1 = dtm1 %>% mutate(word = factor(word, levels = word))
ggplot(tdm1[1:10,], aes(x = word, y = freq)) + geom_bar(stat = "identity")

## 5.2. wordcloud 생성
### random.order = F면 자동으로 빈도수 순으로 내림차순 정렬
wordcloud(words = tdm1$nouns, freq = tdm1$freq, min.freq = 10, max.words = 20, random.order = F, colors = brewer.pal(8, "Dark2"))

## 5.3. 연관 분석
findAssocs(tdm1, c("interested_word1", "interested_word2"), corlimit = 0.7)

## + 형태소 분석
###  - 보통명사: NC / 고유명사: NQ / 동사: PV / 형용사: PA / 부사: MA
library(stringr)
# paste: Concatenate vectors after converting to character.
doc = as.character(text$Content)
pos = paste(SimplePos09(doc)) #Pos22: 9개의 품사 태그를 달아줌 / Pos22: 22개의 품사 태그를 달아줌
extracted = grep(pattern = "P", x = pos, value=T)

3. SNA (Social Network Analysis)

- 잘정리된 링크

- kuduz.tistory.com/1087

최대한 친절하게 쓴 R로 사회연결망 분석하기(feat. tidygraph, ggraph)

사회 연결망(社會連結網) 또는 소셜 네트워크(영어: Social Network)는 사회학에서 개인, 집단, 사회의 관계를 네트워크로 파악하는 개념이다. 즉 개인 또는 집단이 네트워크의 하나의 노드(node)이며,

kuduz.tistory.com

- apple-rbox.tistory.com/11?category=1040726

R과 네트워크 분석 (1)

R과 네트워크 분석 이번 포스팅에서는 소셜 네트워크 분석을 위한 데이터를 R에서 구성하고 분석하는 것을 다뤄보겠습니다. 또한, 데이터 시각화를 효율적으로 해주는 D3기반 패키지(networkD3)의

apple-rbox.tistory.com

https://images.app.goo.gl/Lq1cYgmTLTTnZjhbA

'Data Science Series' 카테고리의 다른 글

[Kaggle Intermediate Machine Learning] Python basic code (0)	2021.04.28
[Kaggle Intro to Machine Learning] Python basic code (0)	2021.04.28
[ADP 실기 with R] 10. 시계열 분석 (Time Series Analysis) (0)	2020.12.12
[ADP 실기 with R] 9. 연관분석: Apriori, FP-Growth (0)	2020.12.12
[ADP 실기 with R] 8. 성과 분석: Confusion Matrix, ROC Curve, AUROC (0)	2020.12.12

'Data Science Series' Related Articles

No Story, No Ecstasy

[ADP 실기 with R] 11. 텍스트마이닝: 문자열 전처리, 한국어(KoNLP), 영어(SnowballC), SNA 본문

[ADP 실기 with R] 11. 텍스트마이닝: 문자열 전처리, 한국어(KoNLP), 영어(SnowballC), SNA

'Data Science Series' 카테고리의 다른 글

티스토리툴바