[파이썬] 웹페이지 가장 많이 사용되는 단어의 빈도

Dev/인공지능

[파이썬] 웹페이지 가장 많이 사용되는 단어의 빈도

코딩하는 백구 2020. 4. 13. 21:17

위키 데이터의 ‘computer’ 용어 웹페이지 데이터를 받아서 가장 많이 사용되는 단어의 빈도를 찾아본다.

- 로직 : 크롤링 / 태그 추출 / 단어 분리 / 단어 정렬 / 단어 카운트

- Code

import urllib.request ## url
import re ## html 태그 제거
from bs4 import BeautifulSoup
## 태그 제거 함수 : https://www.fun25.co.kr/blog/python-remove-html-tag/?page=8 참고
def remove_tag(content):
   cleanr =re.compile('<.*?>')
   cleantext = re.sub(cleanr, '', content)
   return cleantext

## url 설정
url='https://en.wikipedia.org/wiki/Computer'

## 페이지 로드(tag 포함)
webpage=urllib.request.urlopen(url).read().decode('utf-8')

soup = BeautifulSoup(webpage, 'html.parser') #분석 용이하게 파싱

## body 태그만 추출
body=repr(soup.find('body'))

## html 태그 제거
print('태그제거')
notag=remove_tag(body)
## print(notag)

## 단어 분리
splits=notag.replace('\n',' ').replace('-',' ').split(' ')

# 새로운 리스트

list_par = []

# 텍스트를 가지고 있는 리스트

for i in splits:

    # 영어,숫자 및 공백 제거.

    text = re.sub('[^a-zA-Z0-9]',' ',i).strip()

    # 빈 리스트는 제거.
    if(text != ''):
        list_par.append(text)

# 단어정렬
words_count={}
for word in list_par:
    if word in words_count:
        words_count[word] += 1
    else:
        words_count[word] = 1
sorted_words = sorted([(k,v) for k,v in words_count.items()], key=lambda word_count: -word_count[1])
# 단어 빈도 출력
print(sorted_words)

- 실행 결과