Geonsik's Practice Room: 연습 7/30

transactions.csv  

id - see above
 chain - see above
dept(제품 대분류) - An aggregate grouping of the Category (e.g. water)
category(제품 소분류) - The product category (e.g. sparkling water)
company - An id of the company that sells the item  
brand - An id of the brand to which the item belongs
 date - The date of purchase
 product size - The amount of the product purchase (e.g. 16 oz of water)
 product measure - The units of the product purchase (e.g. ounces)  
purchase quantity - The number of units purchased  
purchase amount - The dollar amount of the purchase

1. SELECT DISTINCT brand, pcategory FROM avsc.transaction_1 ORDER BY brand
2. SELECT DISTINCT dept, pcategory FROM avsc.transaction_2 ORDER BY dept

<쿼리 결과 (일부)>
1. 2.

: Brand 속성은 해당 상품의 대분류 혹은 소분류와 독립적이다. 반면에, (전체 테이블을 다 조사하지는 않았지만) 소분류 카테고리 식별번호는 해당하는 대분류의 식별번호를 식별번호 앞에 붙인 뒤에 부여받는 것 같다. 소분류 카테고리(category)는 첫 3백만개 자료에는 0번부터 9999번까지 존재했다.

* 케글, 마트의 고객 가치 평가 예제의 자료로부터.
** 데이터 Row 수가 100만 단위 이상인 경우도 처음인데, 3억개가 되다보니 MySQL에는 입력에만 이틀이 걸렸다. 지호에게서는 MongoDB를 추천받았다. 어떻게든 대용량 자료를 빨리 처리할수 있는 방법으로 갈아타야한다.

Geonsik's Practice Room

페이지

2014년 7월 30일 수요일

연습 7/30

댓글 없음:

댓글 쓰기

프로필